Diagnostic and Prognostic Tests

ABSTRACT

The invention provides methods for diagnosing biological states or conditions based on ratios of gene expression data from tissue samples, such as cancer tissue samples. The invention also provides sets of genes that are expressed differentially in malignant pleural mesothelioma. These sets of genes can be used to discriminate between normal and malignant tissues, and between classes of malignant tissues. Accordingly, diagnostic assays for classification of tumors, prediction of tumor outcome, selecting and monitoring treatment regimens and monitoring tumor progression/regression also are provided.

RELATED APPLICATIONS

This application claims the benefit under 35 U.S.C. § 119(e) of U.S.provisional application 60/317,389, filed Sep. 5, 2001, and U.S.provisional application 60/______, filed Aug. 30, 2002, the entiredisclosures of which are incorporated herein by reference.

GOVERNMENT SUPPORT

This invention was made in part with government support under grantnumber DK58849 from the National Institutes of Health. The United Statesgovernment may have certain rights in this invention.

FIELD OF THE INVENTION

The invention relates to methods for diagnosing conditions, predictingprognoses and optimizing treatment strategies using ratios of geneexpression data. The invention also relates to nucleic acid markers forcancer, particularly for distinguishing malignant pleural mesotheliomafrom other lung cancers or from normal lung tissue, and fordistinguishing between subclasses of malignant pleural mesothelioma.

BACKGROUND OF THE INVENTION

Although much progress has been made toward understanding the biologicalbasis of cancer and in its diagnosis and treatment, it is still one ofthe leading causes of death in the United States. Inherent difficultiesin the diagnosis and treatment of cancer include among other things, theexistence of many different subgroups of cancer and the concomitantvariation in appropriate treatment strategies to maximize the likelihoodof positive patient outcome.

Subclassification of cancer has typically relied on the grouping oftumors based on tissue of origin, histology, cytogenetics,immunohistochemistry, and known biological behavior. The pathologicdiagnosis used to classify the tumor taken together with the stage ofthe cancer is then used to predict prognosis and direct therapy.However, current methods of cancer classification and staging are notcompletely reliable.

Gene expression profiling using microarrays is likely to result inimprovements in cancer classification and prediction of prognosis(Golub, 1999; Perou, 2000; Hedenfalk, 2001; Khan, 2001). Still, thewealth of information garnered using microarrays has, thus far, notyielded effective clinical applications. Global expression analysis hasled to the development of sophisticated computer algorithms seeking toextend data analysis beyond simple expression profiles (Quackenbush,2001; Khan, 2001). At this time, however, no clear consensus existsregarding which computational tools are optimal for the analysis oflarge gene expression profiling data sets, particularly in the clinicalsetting. Moreover, many of these bioinformatics tools under developmentand testing are quite complex leaving the practical use of microarraydata beyond the scope of many biomedical scientists and/or clinicians.With rare exceptions (e.g. PSA and prostate cancer), it is generallyassumed that expression levels of any one gene are insufficient in thediagnosis and/or prognosis of cancer. However, it is equally erroneousto assume a priori that the expression profiles of large numbers ofgenes are explicitly required for this purpose.

It is difficult to predict from standard clinical and pathologicfeatures the clinical course of cancer. However, it is very important inthe treatment of cancer to select and implement an appropriatecombination of therapeutic approaches. The available methods fordesigning strategies for treating cancer patients are complex and timeconsuming. The wide range of cancer subgroups and variations in diseaseprogression limit the predictive ability of the healthcare professional.In addition, continuing development of novel treatment strategies andtherapeutics will result in the addition of more variables to thealready complex decision-making process involving matching the cancerpatient with a treatment regimen that is appropriate and optimized forthe cancer stage, tumor growth rate, and other factors central to theindividual patient's prognosis. Because of the critical importance ofselecting appropriate treatment regimens for cancer patients, thedevelopment of guidelines for treatment selection is of key interest tothose in the medical community and their patients. Thus, there presentlyis a need for objective, reproducible, and sensitive methods fordiagnosing cancer, predicting cancer patient prognosis and outcome, andselecting and monitoring optimal treatment regimens.

SUMMARY OF THE INVENTION

Using focused microarray-based expression profiling, a simple method wasdeveloped to diagnose and predict outcome in patients with malignantpleural mesothelioma (MPM). MPM is a mesodermally derived, neoplasticdisease that arises in the pleura and relentlessly grows into adjacentstructures until it ultimately results in the death of the patient.There are three distinct histological subtypes of MPM: epithelial,mixed, and sarcomatoid (Corson, 1996). Tumor specimens that are linkedto a comprehensive clinical database were utilized to be able todirectly correlate gene expression data to clinical variables such assurvival and develop and test novel prognostic and diagnostic tests forMPM and other cancers. Additional tests have proven the applicability tocancers other than MPM, including lung adenocarcinoma, squamouscarcinoma, medulloblastoma, prostate cancer, breast cancer, ovariancancer, leukemias and lymphomas.

The diagnostic and prognostic methods that were developed utilize geneexpression data from as few as two genes through the use of expressionlevel ratios and rationally chosen thresholds. The effectiveness ofunit-less ratios in diagnosing cancer types was demonstrated andconfirmed using real time quantitative reverse-transcriptase polymerasechain reaction (RT-PCR). This is a simple, but powerful, use ofmicroarray data that can be easily adapted to a clinical setting todiagnose cancer (and non-cancer tissue or diseases) and predict patientoutcome without complex computer software or hardware. Accordingly,diagnostic assays for classification of tumors, prediction of tumoroutcome, selecting and monitoring treatment regimens, and monitoringtumor progression/regression can now be based on the ratios ofexpression of a small number of genes.

The gene expression ratio concept can be applied to other tissues todiagnose or distinguish between tissues in different biological states,such as tissues from subjects having disease and not having disease,subjects that vary in response to pharmaceutical or that metabolizepharmaceutical at different rates, subjects that vary is diseasesusceptibility or predisposition, and the like. Thus a subject'sprognosis or response to treatments, inter alia, can be determinedthrough analysis of a limited set of genes in particular biologicalsamples. Moreover, the gene expression data can be obtained from, andcomparisons can be made between, a number of different methods includingnucleic acid hybridization (e.g., microarrays) and nucleic acidamplification methods (e.g., RT-PCR).

According to one aspect of the invention, methods for diagnosing thepresence in a biological sample of tissue in a first biological state,preferably cancer cells, in a tissue sample is provided. The methodsinclude providing a set of two or more genes, wherein the set comprisesat least one upregulated gene that is expressed in greater amounts in atissue in a first biological state (preferably cancer cells) than in asecond biological state (preferably corresponding non-cancer cells) andat least one downregulated gene that is expressed in lesser amounts in atissue in the first biological state (preferably cancer cells) than inthe second biological state (preferably corresponding non-cancer cells)The methods also include determining the expression levels of the set oftwo or more genes, and calculating a ratio of the expression level ofthe upregulated gene to the expression level of the downregulated gene,wherein the ratio is indicative of the presence of tissue in the firstbiological state (preferably cancer cells) in the tissue sample. Anotherpreferred diagnostic use for the method is to identify non-cancertissues or diseases.

In certain preferred embodiments, there is at least a 2-fold differencein mean expression levels between the at least one upregulated gene andthe at least one downregulated gene. In other preferred embodiments, twoor more expression ratios are calculated. In certain embodiments, thetwo or more expression ratios are combined, preferably by calculatingthe geometric mean of the two or more expression ratios.

In certain embodiments, the ratio is calculated by division of theexpression level of one upregulated gene by the expression level of onedownregulated gene, or by division of the expression levels of two ormore upregulated genes by the expression level of one downregulatedgene, or by division of the expression level of one upregulated gene bythe expression levels of two or more downregulated genes, or by divisionof the expression levels of two or more upregulated genes by theexpression levels of two or more down-regulated genes.

In other embodiments, the methods also include transforming theexpression level data for the upregulated and/or downregulated genesprior to calculating the ratio.

In still other embodiments, the expression levels are determined by amethod selected from the group consisting of nucleic acid hybridizationand nucleic acid amplification. In preferred embodiments, the nucleicacid hybridization is performed using a solid-phase nucleic acidmolecule array. In other preferred embodiments, the nucleic acidamplification method is real-time PCR.

In yet other embodiments, the expression levels are determined by animmunological method, preferably using a solid-phase antibody array, anELISA or ELISPOT assay.

According to preferred embodiments of the foregoing methods, the canceris selected from the group consisting of malignant pleural mesothelioma,lung adenocarcinoma, squamous carcinoma, medulloblastoma, prostatecancer, breast cancer, diffuse large B-cell lymphoma, follicularlymphoma and ovarian cancer.

In certain embodiments, the least one ratio is indicative of thepresence of cancer cells in the tissue sample. In other embodiments, theat least one ratio is indicative of the presence of non-cancer cells inthe tissue sample.

Similar methods as those described above, for determining prognosis of acancer patient, are also provided according to the invention.

According to another aspect of the invention, kits for cancer diagnosisare provided. The kits include a set of one or more ratios applicable tothe analysis of gene expression data, wherein the ratio is calculatedfrom the expression levels of at least one upregulated gene that isexpressed in greater amounts in the cancer cells than in correspondingnon-cancer cells and at least one downregulated gene that is expressedin lesser amounts in cancer cells than in corresponding non-cancercells. In certain embodiments, the kit also includes instructions forthe use of the one or more ratios in the diagnosis of the presence ofcancer cells in a biological sample.

According to a further aspect of the invention, diagnostic systems areprovided. The diagnostic systems include a measurement device thatmeasures gene expression level data of a set of two or more genes,wherein the set comprises at least one upregulated gene that isexpressed in greater amounts in a tissue in a first biological state(preferably cancer cells) than in a second biological state (preferablycorresponding non-cancer cells) and at least one downregulated gene thatis expressed in lesser amounts in the tissue in the first biologicalstate (preferably cancer cells) than in the second biological state(preferably corresponding non-cancer cells). The system also includes adata transformation device that acquires the gene expression data fromthe measurement device and performs data transformation to calculate aratio of the gene expression levels of the upregulated and downregulatedgenes.

In certain embodiments, the data transformation device selects geneexpression data of a selected set of genes from the measurement devicefor calculating the ratio of the selected set of genes, wherein theratio calculated from the gene expression data of the selected set ofgenes is diagnostic for a selected biological state, such as acondition, preferably cancer.

In other embodiments, the cancer diagnostic system also includes a userinterface output device to output the ratio to a user. In preferredembodiments, the cancer diagnostic system also includes a database ofratios of gene expression that are diagnostic for cancers, and acomparison device that compares the ratio calculated from the measuredgene expression to the diagnostic ratios stored in the database andoutputs the comparison to the user interface output device. In otherpreferred embodiments, the cancer diagnostic system also includes adatabase of treatment information for specific cancers, wherein thecomparison device identifies treatment information in the database forthe specific cancer for which the diagnostic ratio matches thecalculated ratio, and wherein the comparison outputs the treatmentinformation to the user interface output device.

According to yet another aspect of the invention, methods for diagnosingmalignant pleural mesothelioma in a subject suspected of havingmalignant pleural mesothelioma are provided. The methods includeobtaining from the subject a tissue sample suspected of being cancerous,determining the expression of a set of nucleic acid molecules orexpression products thereof in the tissue sample, wherein the set ofnucleic acid molecules includes at least two nucleic acid moleculesselected from the group consisting of SEQ ID NOs:9, 11, 13, 15, 17, 19,21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55,57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77. Preferably the set ofnucleic acids includes at least 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleic acid moleculesselected from the group consisting of SEQ ID NOs:9, 11, 13, 15, 17, 19,21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55,57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77.

In certain embodiments, the methods include determining the expressionof the set of nucleic acid molecules or expression products thereof in anon-cancerous tissue sample, and comparing the expression of the set ofnucleic acid molecules or expression products thereof in the tissuesample suspected of being cancerous and the non-cancerous tissue sample.In other embodiments, the methods include calculating a ratio of theexpression of at least two genes among the set of nucleic acidmolecules.

Methods for selecting a course of treatment of a subject having orsuspected of having malignant pleural mesothelioma are provided inanother aspect of the invention. The methods include obtaining from thesubject a tissue sample suspected of being cancerous, determining theexpression of a set of nucleic acid markers or expression productsthereof which are differentially expressed in malignant pleuralmesothelioma tumor tissue samples, and selecting a course of treatmentappropriate to the malignant pleural mesothelioma of the subject. Insome embodiments the methods also include calculating a ratio of theexpression of at least two genes among the set of nucleic acid markersor expression products thereof. In further embodiments, the methodsinclude determining the expression of the set of nucleic acid moleculesor expression products thereof in a non-cancerous tissue sample.

In preferred embodiments, the expression of a set of nucleic acidmarkers is determined by a method selected from the group consisting ofnucleic acid hybridization and nucleic acid amplification. Morepreferably, the nucleic acid hybridization is performed using asolid-phase nucleic acid molecule array, and the nucleic acidamplification method is real-time PCR.

In another aspect of the invention, methods for evaluating treatment ofmalignant pleural mesothelioma are provided. The methods includeobtaining a first determination of the expression of a set of nucleicacid molecules, or expression products thereof, which are differentiallyexpressed in an malignant pleural mesothelioma tumor tissue sample froma subject undergoing treatment for cancer, obtaining a seconddetermination of the expression of the set of nucleic acid molecules, orexpression products thereof, in a second malignant pleural mesotheliomatumor tissue sample from the subject after obtaining the firstdetermination, and comparing the first determination of expression tothe second determination of expression as an indication of evaluation ofthe treatment.

In some embodiments, the determinations of expressions are used tocalculate a ratio of gene expression. In other embodiments, the methodsinclude determining the expression of a set of nucleic acid markerswhich are differentially expressed in non-cancerous tissue samples.

In preferred embodiments, the expression of a set of nucleic acidmarkers is determined by a method selected from the group consisting ofnucleic acid hybridization and nucleic acid amplification. Preferably,the nucleic acid hybridization is performed using a solid-phase nucleicacid molecule array and the nucleic acid amplification method isreal-time PCR.

According to a further aspect of the invention, a solid-phase nucleicacid molecule array is provided which consists essentially of at leasttwo nucleic acid molecules selected from the group consisting of SEQ IDNOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41,43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77fixed to a solid substrate. In some embodiments, the solid-phase nucleicacid molecule array also includes at least one control nucleic acidmolecule.

In preferred embodiments, the set of nucleic acid molecules comprises atleast 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,21, 22, 23, 24, or 25 nucleic acid molecules selected from the groupconsisting of SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67,69, 71, 73, 75, 77.

In certain embodiments, the solid substrate comprises a materialselected from the group consisting of glass, silica, aluminosilicates,borosilicates, metal oxides such as alumina and nickel oxide, variousclays, nitrocellulose, or nylon. In other embodiments, the nucleic acidmolecules are fixed to the solid substrate by covalent bonding.

According to still another aspect of the invention, solid-phase proteinmicroarrays are provided that include at least two antibodies orantigen-binding fragments thereof, that specifically bind at least twodifferent polypeptides selected from the group consisting of SEQ IDNOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78,fixed to a solid substrate.

In some embodiments, the microarray further comprises an antibody orantigen-binding fragment thereof, that binds specifically to acancer-associated polypeptide other than those selected from the groupconsisting of SEQ ID NOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68,70, 72, 74, 76, 78. In other embodiments, the protein microarray alsoincludes at least one control polypeptide molecule.

In preferred embodiments, the antibodies are monoclonal antibodies, orpolyclonal antibodies.

Methods for identifying lead compounds for a pharmacological agentuseful in the treatment of malignant pleural mesothelioma are providedin another aspect of the invention. The methods include contacting amalignant pleural mesothelioma cell or tissue with a candidatepharmacological agent, determining the expression of a set of nucleicacid molecules in the malignant pleural mesothelioma cell or tissuesample under conditions which, in the absence of the candidatepharmacological agent, permit a first amount of expression of the set ofnucleic acid molecules wherein the set of nucleic acid moleculescomprises at least two nucleic acid molecules selected from the groupconsisting of SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31,33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67,69, 71, 73, 75, 77, and detecting a test amount of the expression of theset of nucleic acid molecules, wherein a decrease in the test amount ofexpression in the presence of the candidate pharmacological agentrelative to the first amount of expression indicates that the candidatepharmacological agent is a lead compound for a pharmacological agentwhich is useful in the treatment of malignant pleural mesothelioma. Inpreferred embodiments, the methods also include calculating a ratio ofgene expression.

These and other aspects of the invention will be described in greaterdetail below.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 shows tumor diagnosis using expression ratios. FIG. 1A, patternsof relative expression levels for the 8 genes selected from the trainingset can be extended to the remaining samples. Relative expression levelsincrease from low to high per legend. FIG. 1B, graphic depiction of themagnitude and direction, in all 149 samples comprising the test set, ofthe value for two independent ratios (calretinin/claudin-7 andVAC-β/TACSTD1) chosen for further study. FIG. 1C, the 8 individualsamples (represented by colored bars) that were misdiagnosed using oneratio or the other from FIG. 1B (blue bars for misdiagnosed MPM samples,red bars for misdiagnosed ADCA samples).

FIG. 2 depicts validation of microarray data and ratio based diagnosis.Quantitative RT-PCR was used to obtain ratio values for 12 MPM and 12ADCA tumors. In this case, the two ratios correctly identified 23/24samples with one no-call.

FIG. 3 shows Kaplan-Meier survival predictions for medulloblastomapatients. Overall survival for patients predicted to be treatmentresponders (top line) and treatment failures (bottom line) using a6-gene (5-ratio) model in a test set of samples (n=40). Hash marksindicate censored data.

FIG. 4 shows the validation of microarray-based analysis of geneexpression using real time quantitative RT-PCR. FIG. 4A shows that theaverage expression levels of CFB, transgelin, and fibronectin aresignificantly (P<0.05) different in tumor samples from Subclass 1 andSubclass 2. FIG. 4B shows that the expression level ratios remainconsistent in distinguishing epithelial tumor samples from all othersusing data obtained from either microarray analysis or RT-PCR. Ratiorepresents the average gene expression level in epithelial subtypetumors relative to the average expression level of all other tumorscombined. Error bars, SEM; M, data from microarray analysis; RT-PCR,data from quantitative RT-PCR analysis.

FIG. 5 depicts prediction of outcome in MPM using expression ratios ortumor histology. FIG. 5A, survival of 31 MPM patients whose outcome waspredicted using a 4-gene expression ratio model. FIG. 5B, survival ofthe 31 MPM patient samples from FIG. 5A plus 5 additional samples (36total) as a function of tumor histological subtype. Prediction ofoutcome using the geometric mean value of 3 expression ratios is moreaccurate than the use of histological appearance alone at identifyingpatients with widely divergent outcome (FIG. 5A). Although patients withepithelial histology tumors tend to survive longer, predicting prognosisin this manner is highly inexact for any individual patient (FIG. 5B).Each data point represents a single sample. Circles enclose tumorsamples from patients with survival at or near the median for MPM.Horizontal bars depict median survival for each group. *, geometric meancalculated from the 3 most accurate expression ratios used to predictoutcome (using data from a total of 4 genes).

FIG. 6 shows Kaplan-Meier survival predictions for mesothelioma patientsand verification of microarray data. FIG. 6A, overall survival for all31 patients from which the training set was chosen. The estimated mediansurvival for entire cohort was 11 months. FIG. 6B, overall survivalbased on the histological subtype of the tumor. The top line representsepithelial subtype tumors (median survival=17 months) and the bottomline represents non-epithelial subtype tumors (median survival=8.5months). Although epithelial subtype tumors tend to favor longersurvival, prediction of outcome in this manner is highly inexact and notaccurate for individual samples. FIG. 6C, geometric mean values obtainedfor 6 randomly chosen samples (3 each from good and poor outcome groups)using quantitative RT-PCR confirmed microarray data (M).

FIG. 7 depicts independent validation of the 4-gene expression ratiomodel. FIG. 7A, overall survival for 29 independent mesotheliomapatients. Similar to the initial 31 samples, the estimated mediansurvival for this cohort was 12 months. FIG. 7B, overall survival basedon the histological subtype of the tumor. The median survival ofepithelial subtype tumors (top line, median survival=17 months) andnon-epithelial subtype tumors (bottom line, median survival=12 months)in the new sample set was identical to that for the previous 31 samplesand was equally insufficient for predicting outcome. FIG. 7C overallsurvival in the new set of samples for good outcome (top line, mediansurvival=36 months) and poor outcome (bottom line, median survival=7months) groups as defined by the 4 gene expression ratio model(utilizing RT-PCR for data acquisition). The 4-gene expression ratiomodel significantly (P=0.0035) predicts outcome in mesothelioma in anindependent set of 29 samples.

FIG. 8 shows Kaplan-Meier disease-free survival predictions for breastcancer patients. Time to relapse for patients predicted to be goodprognosis (top line) and poor prognosis (bottom line) using a 6-ratiomodel in the test set of samples (n=19). Hash marks indicate censoreddata.

FIG. 9 shows Kaplan-Meier survival predictions of test set samples foradenocarcinoma patients as described in Example 5. Time to relapse forpatients predicted to be good prognosis (top line) and poor prognosis(bottom line) using a 3-ratio model in the test set data ofBhattacharjee et al. Hash marks indicate censored data.

DETAILED DESCRIPTION OF THE INVENTION

Gene expression profiling using high density oligonucleotide arrays hasfigured prominently in recent studies using gene expression patterns incancer to improve diagnosis and subclassification. Specifically,microarrays have been used to distinguish between acute myeloid leukemia(AML) and acute lymphoblastic leukemia (ALL) (Golub, 1999), to exploremolecular differences within the AML group of diseases (Virtaneva,2001), to identify subclasses of breast cancer (Perou, 2000) and ovariancarcinoma (Welsh, 2001), and to define the metastatic phenotype ofmelanoma (Clark, 2000).

Although microarray-based analysis of gene expression in cancer hasyielded a wealth of information, effective clinical applications havenot followed for several reasons. There are no universally accepted andapplicable computational methods to analyze microarray data(Quackenbush, 2001). Also, studies utilizing microarrays have lacked acomprehensive clinical database linking patient characteristics to theirtumors' gene expression patterns. Furthermore, the prospect of having touse large numbers of genes to diagnose a disease subclass would requirea relatively expensive analytical approach such as microarrays. Finally,sophisticated computer algorithms currently used for analysis ofmicroarrays (Quackenbush, 2001; Khan, 2001) have placed the practicaluse of the resulting data beyond the reach of many biomedicalscientists. These limitations were addressed using focused geneexpression profiling of MPM in combination with an extensive clinicaldatabase to create an unexpectedly simple and effective ratio methodwith general clinical applicability (i.e., for cancers beyond MPM) forperforming relatively low cost diagnosis and prediction of prognosis incancer.

In contrast to many microarray-based studies seeking to compare geneexpression patterns between two or more predefined groups, unsupervisedclustering was first used for class discovery in MPM. In this way, theintroduction of experimental bias that follows from assuming that tumorsof the same histological subtype necessarily possess similar geneexpression profiles was avoided. By extension, prognostic genes wereidentified based on differential expression levels between tumors thatwere members of the two subclasses with the best and worst prognoses,and not based simply on tumor histology. The fact that the prognosticgenes so identified also distinguish epithelial tumors is coincidental,though not surprising since patients with epithelial subtype tumors tendto survive longer than those with mixed subtype tumors.

Subclassification using unsupervised clustering also presents a morebiologically relevant organization. It has been shown that similar tumorappearance in itself does not necessitate similar patterns of geneexpression nor final clinical outcome. For example, it is not unusualfor patients with lung cancers of identical histology, differentiation,location, and stage to have diverging survival (Mountain, 1997). In theexperiments described herein, one subclass contained tumors of all threemajor histological subtypes, suggesting (i) that tumors of diverseappearance are more similar than originally thought, (ii) that allsubtypes of MPM are correctly classified as a single disease, and (iii)that histology alone is not sufficient to determine prognosis.

Patient outcome depends on the phenotype of individual tumors at themolecular level, and this is reflected directly in gene expression. Therecent explosion of bioinformatics has facilitated exploration ofcomplex patterns of gene expression in human tissues (Fodor, 1997).However, exact relationships between gene expression patterns in cancerand clinical data remain largely undefined. Sophisticated computeralgorithms have been recently developed capable of molecular diagnosisof tumors using the immense data sets generated by expression profiling(Khan, 2001). Though valid, the widespread clinical applicability ofthese techniques in the foreseeable future is questionable. The studydescribed herein shows that diagnosis and prognosis of cancer using dataoriginally obtained from microarrays is not explicitly dependent on theuse of increasingly complex technology or complicated methods.

Microarrays themselves are evolving at a rapid pace and gene expressionanalysis in this manner remains an expensive endeavor. Therefore,comparing historical data to that obtained from new generationmicroarrays remains a priority for most investigators. Yet there are nosatisfactory solutions to date that adequately address all of thenormalization issues encountered when attempting to merge data fromolder microarrays, or those from multiple manufacturers. Examination ofratios of gene expression, as described herein, as opposed to absoluteexpression levels, also assists in the practical use of data from theolder generation of commercially obtained microarrays.

The invention described herein also relates to the identification of aset of genes expressed in cancer tissue that are diagnostic for thecancer and/or predictive of the clinical outcome of the cancer. In oneaspect, ratios of gene expression are used as indicia of cancer type,cancer class, and/or cancer prognosis, all of which are useful fordetermining a course of treatment of a patient.

Changes in cell phenotype in cancer are often the result of one or morechanges in the genome expression of the cell. Some genes are expressedin tumor cells, and not in normal cells. Other genes are expressed athigher or lower levels in cancer cells than in normal counterparts. Inaddition, certain genes are expressed in different levels in differentsubgroups of cancers, which have different prognoses and requiredifferent treatment regimens to optimize patient outcome. Thedifferential expression of such genes can be examined by the assessmentof nucleic acid or protein expression in the cancer tissue.

One of the recent developments in gene expression analysis involves theuse of microarrays to measure simultaneously the expression of hundredsor thousands of genes. Practical application of this technology requiresthat researchers or laboratories have a sophisticated knowledge ofmolecular biology to generated gene expression data, and of computeralgorithms for analysis of the large quantities of data generated by theuse of the microarrays. The requirements for such knowledge make the useof microarrays impractical in the clinical setting, and difficult evenfor research laboratories. In addition, one must account for differencesin microarray architecture, sample preparation, and analytical equipmentthat captures the signals from the microarrays.

The use of gene expression ratios in the diagnosis and prediction ofprognosis in cancer overcomes several major obstacles to the clinicaluse of microarray data. The methodology described herein avoids thetechnical difficulties described above. It generates a simple numericalmeasure that can be used to predict various aspects of patient clinicaldata (such as histological subtype and survival) using a single patientbiopsy sample. Since this non-linear function of gene expression is aunit-less number, expression levels can be measured using any reliablemethod such as quantitative RT-PCR or microarrays (nucleic acid orprotein) regardless of the type of data capture equipment. Thus, thepresent invention permits the diagnosis of cancer by clinicallaboratories using standard equipment without the requirement forsophisticated data analysis.

Importantly, the diagnostic/prognostic accuracy of ratios permits anearlier definitive diagnosis using initial biopsy samples and revealsimportant clues about anticipated patient outcome prior to theassignment of treatment strategies. Considering the clinical treatmentof MPM, for example, an initial diagnosis is usually made for patientspresenting with a malignant pleural effusion. Typically, this diagnosisis confirmed prior to subjecting patients to major surgical resections.Unfortunately, standard pathological techniques for diagnosis even atthis point may be inadequate due to a lack of suitable quantities oftissue. As a consequence, the histological subtype of the tumorinitially diagnosed may not always be the same as that conclusivelydetermined at the time of surgery (samples analyzed in this study wereobtained at surgery when sufficient amount of tissue was available for adefinitive pathological diagnosis). This makes it difficult, if notimpossible, to stratify treatment based on histological subtyping byprevailing methods. Diagnosis of other cancers is hampered by similarproblems. Ratios obtained using tumor tissues taken at the time ofinitial biopsy can provide a firm diagnosis, determine subclass, andpredict outcome after therapy when current pathological techniques areinsufficient.

The invention also provides a new, more powerful method of stratifyingpatients with MPM (and more generally, is applicable to other cancersand other biological states and conditions). It has been previouslydocumented that patients with the epithelial subtype generally enjoy abetter prognosis than patients with non-epithelial histology (regardlessof treatment strategy) and benefit from aggressive surgical resection.However, this is not an all-inclusive phenomenon; some patients withnon-epithelial histology enjoy a longer survival than those withepithelial histology. These factors make it difficult to design clinicalstudies to explore alternative treatment strategies based onhistological subtype.

The results presented herein provide a basis for at least one rationalexplanation of the aforementioned phenomenon: within MPM, there areactually two classes of epithelial tumors and two classes of mixedtumors. A series of simple tests utilizing ratios of gene expression isproposed that can determine with a high degree of accuracy the correcttumor histological subtype/subclass, and the likely clinical outcome ofthe patient. This information can be produced from a small tissue biopsyand does not require major surgery. Such classification is useful in thedevelopment of meaningful clinical trials in MPM. It therefore can behypothesized that patients found to have tumors representative of thosein Subclass 2 (short-lived mixed subtype) are excellent candidates forneo-adjuvant chemotherapy protocols as they are unlikely to benefit fromsurgery, whereas patients in Subclasses 1 and 3 are more likely to enjoylong term survival after surgical therapy.

Expression ratios involving two genes that vary in expression betweendifferent sample types (e.g., cancer/non-cancer) were used to diagnoseand predict prognosis in MPM. Diagnostic and/or prognostic genes ingeneral can be initially identified from microarray analysis and then betested for clinical relevancy using simpler methods such as RT-PCR. Toaccomplish similar feats in other biological conditions or states,including other cancers, it may be necessary to use expression ratiosincluding different mathematical combinations and/or more than twogenes. The ratio concept described herein (e.g., for clinical use) issimply the relationship between the expression levels of multiple genesthat vary in expression between two different sample types, i.e.,samples that have different biological properties or were obtained fromsubjects having different phenotypes, such as cancer/non-cancerphenotypes, responsive/non-responsive to stimuli, susceptible/notsusceptible to disease, different metabolic functions, etc. Non-linearunit-less ratios, in any form, can still remain simple if a relativelysmall number of genes are used in such a way as to not require complexcomputational software. Therefore, expression ratios of selected genesthat vary in expression in two different biological samples may be usedto translate complex data sets into simple tests that give clinicallyuseful information for the diagnosis and prediction of prognosis ofcancers.

Ratios of gene expression levels can be calculated from expression dataof two or more genes at the mRNA level and or protein level. Expressionlevels of two or more isoforms or variants of the same gene (e.g.,splice variants or post-translationally modified variants) also can beused in the ratios. In contrast to prior methods for comparing geneexpression, which compared the expression levels of genes relative to angene having substantially unchanging expression (e.g., a housekeepinggene), the present method compares the expression of two or more genesthat differ in expression between two (or more) biological states. Thusin a preferred embodiment, ratios are calculated from expression data oftwo or more genes, wherein one of the two or more genes is expressed athigher levels in a first biological state relative to the secondbiological state (upregulated in the first biological state), and asecond of the two or more genes is expressed at lower levels in a secondbiological state relative to the first biological state (downregulatedin the first biological state). Examples of this are demonstratedherein, wherein the expression levels of two or more genes that differin expression in mesothelioma and normal tissue, or in subclasses ofmesothelioma, are used to calculate ratios that effectively predict thephenotype of unknown tissue samples.

The ratios can be simple ratios (e.g., x/y) or more complex ratios thatinclude mathematical manipulation of gene expression levels, forexample, (x+a)/(y+b) or x³/y³, wherein x and y represent the expressionlevel data for genes X and Y, and a and b can be either expression leveldata for genes A and B, or mathematical factors. The use of the ratiosis not limited to one set of two genes. Additional sets of genes (twosets, three sets, or more sets) may be required to provide an optimallyaccurate diagnosis of certain biological states or conditions (e.g.,cancers) based on the expression of certain sets of genes. Thus themethods are not limited to a ratio of two genes; a total of 4, 6, ormore genes and various ratios of them may be used. Furthertransformation of the data in the form of multiple gene expressionratios also can be performed. In certain preferred embodiments, thegeometric mean of multiple gene ratios is calculated. The expressiondata used to calculate the ratios may be obtained using any art-knownmethod for analyzing gene expression including microarrays (e.g.,standard or custom arrays; nucleic acid, protein or antibody arrays),quantitative RT-PCR, antibody or other immunoassay measurements, etc.

The ratios can be used to diagnose any condition having a geneticcomponent in which two or more genes are differentially expressed in twoor more biological states. Conditions include diseases, susceptibilityto diseases, metabolic functions (e.g., variability in the metabolism ofdrugs), response to injury, responses to local cellular environments andthe like. In preferred embodiments, the condition is a disease. Forexample, any diseases that are characterized by (1) the relativeincrease in the expression of a first gene in a first disease state, and(2) the relative increase in the expression of a second gene in a seconddisease state or nondisease state, can be diagnosed using ratios of geneexpression. Preferred examples of such diseases are cancer, asdemonstrated herein for malignant pleural mesothelioma. The ratios ofgene expression also can be used to predict a condition outcome orcondition prognosis, to monitor onset of a condition, to monitortreatment, and to select a course of treatment for a condition.

The gene expression data for calculation of the ratios may be obtainedfrom analysis of biological samples including tissue, blood, urine,cerebrospinal fluid or other bodily fluids of a subject (e.g., humans orother animals). The expression data can be used without anytransformation to calculate a simple ratio of two or more genes asexemplified in the Examples, or data transformation can be applied priorto, or as a part of, calculating the ratios.

The ratio calculation and/or data transformation can be performed by thedevice that captures the expression data (e.g., a device for performingreal-time PCR or a microarray reader), or can be performed by a separatecomputer running appropriate software.

In certain embodiments, software for calculating ratios as describedherein can be provided on a computer connected by data link to a datagenerating device, such as a microarray reader or PCR machine. Anystandard data link can be used, including serial or parallel cables,radio frequency or infrared telemetry links, LAN connections, WANconnections, etc. Alternatively, data can be transferred bycomputer-readable medium (e.g., magnetic or optical medium) and read bythe software. The data also can be entered directly by the user via userinterface, such as a keyboard, monitor, mouse, graphical user interfacesuch as touch screen, etc. The computer may be contained within the datagenerating device, providing an integrated system for generating rawdata, calculating ratios, and displaying such ratios. One or morecomputers also may be linked to one or more data generating devices andone or more display devices, such as in a local area network or widearea network.

After acquiring the raw gene expression data from the data generatingdevice, the data for the variables examined can be used to calculategene expression ratios in accordance with the methods of the invention.The software can allow the user to select a number of genes preferredfor diagnosis or prognosis, or the software may calculate ratios for astandardized set or sets of genes (e.g., genes known to be useful forclassification of a tissue type or set of tissue types). The softwarecan execute data transformation algorithms from a preselected group, orcan allow the user to input other algorithms. The ratio data can bestored in a data file, printed, and/or directly displayed to the user ona graphical user interface.

In one embodiment of the invention, a visual display is used to displaythe ratio data for the classification, diagnosis and or prediction ofprognosis. The visual display can be a graphical user interface, such asa monitor, or a printer.

The invention also relates to the identification of a set of genes thatpermit confirmation of the presence of malignant pleural mesotheliomacells in biological samples. Probes for the expression of the genes canbe incorporated into a custom array for diagnosis of malignant pleuralmesothelioma. The genes identified permit, inter alia, rapid screeningof cancer samples by nucleic acid microarray hybridization or proteinexpression technology to determine the expression of the specific genesand thereby to predict the outcome of the cancer. A microarray also canbe used to diagnose malignant pleural mesothelioma, distinguish it fromlung cancer (adenocarcinoma and squamous carcinoma), normal lung tissueand/or pleura. One also can use the custom arrays (or standard arraysthat contain the genes identified herein) to identify the histologicalsubtype of MPM, the subclass of MPM for determining prognosis. Suchscreening is beneficial, for example, in selecting the course oftreatment to provide to the cancer patient (i.e., directing therapy),and to monitor the efficacy of a treatment.

The invention differs from traditional cancer diagnostic andclassification techniques with respect to the speed, simplicity, andreproducibility of the cancer diagnostic assay. The invention alsodiffers from other microarray-based diagnostic methods in that it doesnot require extensive data analysis or data transformation employingcomplex algorithms. Further, the invention differs from other cancerdiagnostic methods in that it permits accurate diagnosis andclassification of tumors by the analysis of a limited set of genes. Theuse of a limited set of genes in the methods permits the use of simplermethods for acquisition of data, e.g., nucleic acid hybridization basedmethods such as RT-PCR, that do not generate massive quantities of datafrom parallel analysis of a large number of genes. The invention alsopresents targets for drug development because it identifies genes thatare differentially expressed in tumors, which can be utilized in thedevelopment of drugs to treat such tumors, e.g., by reducing expressionof the genes or reducing activity of proteins encoded by the genes.

The invention simplifies prognosis determination by providing anidentified set of a small number of genes whose level of expression inmalignant pleural mesothelioma predicts clinical outcome as defined by,e.g., patient survival times. In developing the invention, RNAexpression phenotyping was performed using high density microarraysgenerated from quantitative expression data on over 12,000 genes, whichhave been analyzed to identify specific probe sets (genes). Theexpression gene set has multifold uses including, but not limited to,the following examples. The expression gene set may be used as aprognostic tool for malignant pleural mesothelioma patients, to makepossible more finely tuned diagnosis of malignant pleural mesotheliomaand allow healthcare professionals to tailor treatment to individualpatients' needs. The invention can also assess the efficacy of cancertreatment by determining progression or regression of malignant pleuralmesothelioma cancer in patients before, during, and after treatment.Another utility of the expression gene set is in the biotechnology andpharmaceutical industries' research on disease pathway discovery fortherapeutic targeting. The invention can identify alterations in geneexpression in malignant pleural mesothelioma and can also be used touncover and test candidate pharmaceutical agents to treat malignantpleural mesothelioma.

As used herein, a subject is a human, non-human primate, cow, horse,pig, sheep, goat, dog, cat, or rodent. In all embodiments human subjectsare preferred. In aspects of the invention pertaining to diagnosis ofmalignant pleural mesothelioma, the subject is a human either suspectedof having malignant pleural mesothelioma, or having been diagnosed withmalignant pleural mesothelioma. In aspects of the invention pertainingto cancer diagnosis in general, using the non-linear methods employingratios of gene expression described herein, the subject preferably is ahuman suspected of having cancer, or a human having been previouslydiagnosed as having cancer. Methods for identifying subjects suspectedof having cancer may include physical examination, subject's familymedical history, subject's medical history, biopsy, or a number ofimaging technologies such as ultrasonography, computed tomography,magnetic resonance imaging, magnetic resonance spectroscopy, or positronemission tomography. Diagnostic methods for cancer and the clinicaldelineation of cancer diagnoses are well known to those of skill in themedical arts.

As used herein, a tissue sample is tissue obtained from a tissue biopsyusing methods well known to those of ordinary skill in the relatedmedical arts. The phrase “suspected of being cancerous” as used hereinmeans a cancer tissue sample believed by one of ordinary skill in themedical arts to contain cancerous cells. Methods for obtaining thesample from the biopsy include gross apportioning of a mass,microdissection, laser-based microdissection, or other art-knowncell-separation methods.

Because of the variability of the cell types in diseased-tissue biopsymaterial, and the variability in sensitivity of the diagnostic methodsused, the sample size required for analysis may range from 1, 10, 50,100, 200, 300, 500, 1000, 5000, 10,000, to 50,000 or more cells. Theappropriate sample size may be determined based on the cellularcomposition and condition of the biopsy and the standard preparativesteps for this determination and subsequent isolation of the nucleicacid for use in the invention are well known to one of ordinary skill inthe art. An example of this, although not intended to be limiting, isthat in some instances a sample from the biopsy may be sufficient forassessment of RNA expression without amplification, but in otherinstances the lack of suitable cells in a small biopsy region mayrequire use of RNA conversion and/or amplification methods or othermethods to enhance resolution of the nucleic acid molecules. Suchmethods, which allow use of limited biopsy materials, are well known tothose of ordinary skill in the art and include, but are not limited to:direct RNA amplification, reverse transcription of RNA to cDNA,amplification of cDNA, or the generation of radio-labeled nucleic acids.

As used herein, the phrase determining the expression of a set ofnucleic acid molecules in the tissue means identifying RNA transcriptsin the tissue sample by analysis of nucleic acid or protein expressionin the tissue sample. As used herein for diagnosis of MPM and/ordetermination of outcome of MPM patients, “set” refers to a group ofnucleic acid molecules that include 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,31, 32, 33, 34, or 35 different nucleic acid sequences from the group of26 nucleic acid sequences in Table 1 (SEQ ID NOs: 9, 11, 13, 15, 17, 19,21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55,57 and 59) and/or from the group of 11 nucleic acid sequences in Table 3(SEQ ID Nos: 43, 45, 61, 63, 65, 67, 69, 71, 73, 75 and 77). Other setswill be used for other malignancies or other disorders to determine generatios for diagnosis, outcome determination and the like; some of thesedata sets are described in the Examples below.

The expression of the set of nucleic acid molecules in the sample fromthe patient suspected of having malignant pleural mesothelioma can becompared to the expression of the set of nucleic acid molecules in asample of tissue that is non-cancerous. As used herein with respect todiagnosis of malignant pleural mesothelioma, non-cancerous tissue meanstissue determined by one of ordinary skill in the medical art to have noevidence of malignant pleural mesothelioma based on standard diagnosticmethods including, but not limited to, histologic staining andmicroscopic analysis.

Nucleic acid markers for cancer are nucleic acid molecules that by theirpresence or absence indicate the presence of absence of malignantpleural mesothelioma. In tissue, certain nucleic acid molecules areexpressed at different levels depending on whether tissue isnon-cancerous or cancerous.

Hybridization methods for nucleic acids are well known to those ofordinary skill in the art (see, e.g. Molecular Cloning: A LaboratoryManual, J. Sambrook, et al., eds., Second Edition, Cold Spring HarborLaboratory Press, Cold Spring Harbor, N.Y., 1989, or, Current Protocolsin Molecular Biology, F. M. Ausubel, et al., eds., John Wiley & Sons,Inc., New York). The nucleic acid molecules from a malignant pleuralmesothelioma tissue sample hybridize under stringent conditions tonucleic acid markers expressed in malignant pleural mesothelioma. In oneembodiment the markers are sets of two or more of the nucleic acidmolecules as set forth in Table 1 (SEQ ID NOs: 9, 11, 13, 15, 17, 19,21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55,57 and 59) or Table 3 (SEQ ID Nos: 43, 45, 61, 63, 65, 67, 69, 71, 73,75, 77).

The malignant pleural mesothelioma nucleic acid markers disclosed hereinare known genes and fragments thereof. It may be desirable to identifyvariants of those genes, such as allelic variants or single nucleotidepolymorphisms (SNPs) in tissues. Accordingly, methods for identifyingmalignant pleural mesothelioma nucleic acid markers, including variantsof the disclosed full-length cDNAs, genomic DNAs, and SNPs are alsoincluded in the invention. The methods include contacting a nucleic acidsample (such as a cDNA library, genomic library, genomic DNA isolate,etc.) with a nucleic acid probe or primer derived from one of SEQ IDNOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41,43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77.The nucleic acid sample and the probe or primer hybridize tocomplementary nucleotide sequences of nucleic acids in the sample, ifany are present, allowing detection of nucleic acids related to SEQ IDNOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41,43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77.Preferably the probe or primer is detectably labeled. The specificconditions, reagents, and the like can be selected by one of ordinaryskill in the art to selectively identify nucleic acids related to setsof two or more of SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29,31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65,67, 69, 71, 73, 75, 77. The isolated nucleic acid molecule can besequenced according to standard procedures.

In addition to native nucleic acid markers (SEQ ID NOs:9, 11, 13, 15,17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77), the invention alsoincludes degenerate nucleic acids that include alternative codons tothose present in the native materials. For example, serine residues areencoded by the codons TCA, AGT, TCC, TCG, TCT, and AGC. Each of the sixcodons is equivalent for the purposes of encoding a serine residue.Similarly, nucleotide sequence triplets that encode other amino acidresidues include, but are not limited to: CCA, CCC, CCG, and CCT prolinecodons); CGA, CGC, CGG, CGT, AGA, and AGG (arginine codons); ACA, ACC,ACG, and ACT (threonine codons); AAC and AAT (asparagine codons); andATA, ATC, and ATT (isoleucine codons). Other amino acid residues may beencoded similarly by multiple nucleotide sequences. Thus, the inventionembraces degenerate nucleic acids that differ from the biologicallyisolated nucleic acids in codon sequence due to the degeneracy of thegenetic code.

The invention also provides modified nucleic acid molecules, whichinclude additions, substitutions, and deletions of one or morenucleotides such as the allelic variants and SNPs described above. Inpreferred embodiments, these modified nucleic acid molecules and/or thepolypeptides they encode retain at least one activity or function of theunmodified nucleic acid molecule and/or the polypeptides, such ashybridization, antibody binding, etc. In certain embodiments, themodified nucleic acid molecules encode modified polypeptides, preferablypolypeptides having conservative amino acid substitutions. As usedherein, a “conservative amino acid substitution” refers to an amino acidsubstitution which does not alter the relative charge or sizecharacteristics of the protein in which the amino acid substitution ismade. Conservative substitutions of amino acids include substitutionsmade amongst amino acids within the following groups: (a) M, I, L, V;(b) F, Y, W; (c) K, R, H; (d) A, G; (e) S, T; (f) Q, N; and (g) E, D.The modified nucleic acid molecules are structurally related to theunmodified nucleic acid molecules and in preferred embodiments aresufficiently structurally related to the unmodified nucleic acidmolecules so that the modified and unmodified nucleic acid moleculeshybridize under stringent conditions known to one of skill in the art.

For example, modified nucleic acid molecules that encode polypeptideshaving single amino acid changes can be prepared for use in the methodsand products disclosed herein. Each of these nucleic acid molecules canhave one, two, or three nucleotide substitutions exclusive of nucleotidechanges corresponding to the degeneracy of the genetic code as describedherein. Likewise, modified nucleic acid molecules that encodepolypeptides having two amino acid changes can be prepared, which have,e.g., 2-6 nucleotide changes Numerous modified nucleic acid moleculeslike these will be readily envisioned by one of skill in the art,including for example, substitutions of nucleotides in codons encodingamino acids 2 and 3, 2 and 4, 2 and 5, 2 and 6, and so on. In theforegoing example, each combination of two amino acids is included inthe set of modified nucleic acid molecules, as well as all nucleotidesubstitutions that code for the amino acid substitutions. Additionalnucleic acid molecules that encode polypeptides having additionalsubstitutions (i.e., 3 or more), additions or deletions [e.g., byintroduction of a stop codon or a splice site(s)] also can be preparedand are embraced by the invention as readily envisioned by one ofordinary skill in the art. Any of the foregoing nucleic acids can betested by routine experimentation for retention of structural relationto or activity similar to the nucleic acids disclosed herein.

In the invention, standard hybridization techniques of microarraytechnology are utilized to assess patterns of nucleic acid expressionand identify nucleic acid marker expression. Microarray technology,which is also known by other names including: DNA chip technology, genechip technology, and solid-phase nucleic acid array technology, is wellknown to those of ordinary skill in the art and is based on, but notlimited to, obtaining an array of identified nucleic acid probes on afixed substrate, labeling target molecules with reporter molecules(e.g., radioactive, chemiluminescent, or fluorescent tags such asfluorescein, Cye3-dUTP, or Cye5-dUTP), hybridizing target nucleic acidsto the probes, and evaluating target-probe hybridization. A probe with anucleic acid sequence that perfectly matches the target sequence will,in general, result in detection of a stronger reporter-molecule signalthan will probes with less perfect matches. Many components andtechniques utilized in nucleic acid microarray technology are presentedin The Chipping Forecast, Nature Genetics, Vol. 21, January 1999, theentire contents of which is incorporated by reference herein.

According to the present invention, microarray substrates may includebut are not limited to glass, silica, aluminosilicates, borosilicates,metal oxides such as alumina and nickel oxide, various clays,nitrocellulose, or nylon. In all embodiments a glass substrate ispreferred. According to the invention, probes are selected from thegroup of nucleic acids including, but not limited to: DNA, genomic DNA,cDNA, and oligonucleotides; and may be natural or synthetic.Oligonucleotide probes preferably are 20 to 25-mer oligonucleotides andDNA/cDNA probes preferably are 500 to 5000 bases in length, althoughother lengths may be used. Appropriate probe length may be determined byone of ordinary skill in the art by following art-known procedures. Inone embodiment, preferred probes are sets of two or more of the nucleicacid molecules set forth as SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23,25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59,61, 63, 65, 67, 69, 71, 73, 75, 77 (see also Table I and Table 3).Probes may be purified to remove contaminants using standard methodsknown to those of ordinary skill in the art such as gel filtration orprecipitation.

In one embodiment, the microarray substrate may be coated with acompound to enhance synthesis of the probe on the substrate. Suchcompounds include, but are not limited to, oligoethylene glycols. Inanother embodiment, coupling agents or groups on the substrate can beused to covalently link the first nucleotide or oligonucleotide to thesubstrate. These agents or groups may include, but are not limited to:amino, hydroxy, bromo, and carboxy groups. These reactive groups arepreferably attached to the substrate through a hydrocarbyl radical suchas an alkylene or phenylene divalent radical, one valence positionoccupied by the chain bonding and the remaining attached to the reactivegroups. These hydrocarbyl groups may contain up to about ten carbonatoms, preferably up to about six carbon atoms. Alkylene radicals areusually preferred containing two to four carbon atoms in the principalchain. These and additional details of the process are disclosed, forexample, in U.S. Pat. No. 4,458,066, which is incorporated by referencein its entirety.

In one embodiment, probes are synthesized directly on the substrate in apredetermined grid pattern using methods such as light-directed chemicalsynthesis, photochemical deprotection, or delivery of nucleotideprecursors to the substrate and subsequent probe production.

In another embodiment, the substrate may be coated with a compound toenhance binding of the probe to the substrate. Such compounds include,but are not limited to: polylysine, amino silanes, amino-reactivesilanes (Chipping Forecast, 1999) or chromium (Gwynne and Page, 2000).In this embodiment, presynthesized probes are applied to the substratein a precise, predetermined volume and grid pattern, utilizing acomputer-controlled robot to apply probe to the substrate in acontact-printing manner or in a non-contact manner such as ink jet orpiezo-electric delivery. Probes may be covalently linked to thesubstrate with methods that include, but are not limited to,UV-irradiation. In another embodiment probes are linked to the substratewith heat.

Targets are nucleic acids selected from the group, including but notlimited to: DNA, genomic DNA, cDNA, RNA, mRNA and may be natural orsynthetic. In all embodiments, nucleic acid molecules from human tissueare preferred. The tissue may be obtained from a subject or may be grownin culture (e.g., from a malignant pleural mesothelioma cell line).

In embodiments of the invention one or more control nucleic acidmolecules are attached to the substrate. Preferably, control nucleicacid molecules allow determination of factors including but not limitedto: nucleic acid quality and binding characteristics; reagent qualityand effectiveness; hybridization success; and analysis thresholds andsuccess. Control nucleic acids may include but are not limited toexpression products of genes such as housekeeping genes or fragmentsthereof.

In one embodiment of the invention, expression of nucleic acid markersis used to select clinical treatment paradigms for cancers, such asmalignant pleural mesothelioma. Treatment options, as described herein,may include but are not limited to: radiotherapy, chemotherapy, adjuvanttherapy, or any combination of the aforementioned methods. Aspects oftreatment that may vary include, but are not limited to: dosages, timingof administration, or duration or therapy; and may or may not becombined with other treatments, which may also vary in dosage, timing,or duration. Another treatment for malignant pleural mesothelioma issurgery, which can be utilized either alone or in combination with anyof the aforementioned treatment methods. One of ordinary skill in themedical arts may determine an appropriate treatment paradigm based onevaluation of differential expression of sets of two or more genes, suchas those set forth as SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27,29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63,65, 67, 69, 71, 73, 75, 77 for malignant pleural mesothelioma. Cancersthat express markers that are indicative of a more aggressive cancer orpoor prognosis may be treated with more aggressive therapies.

Progression or regression of malignant pleural mesothelioma isdetermined by comparison of two or more different malignant pleuralmesothelioma tissue samples taken at two or more different times from asubject. For example, progression or regression may be evaluated byassessments of expression of sets of two or more of the nucleic acidtargets, preferably using ratios of expression, including but notlimited to SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69,71, 73, 75, 77, in an malignant pleural mesothelioma tissue sample froma subject before, during, and following treatment for malignant pleuralmesothelioma. Progression or regression or other cancers or diseasestates would be determined similarly.

In another embodiment, novel pharmacological agents useful in thetreatment of malignant pleural mesothelioma can be identified byassessing variations in the expression of sets of two or more malignantpleural mesothelioma nucleic acid markers (preferably, variations in theratios of expression), from among SEQ ID NOs:9, 11, 13, 15, 17, 19, 21,23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57,59, 61, 63, 65, 67, 69, 71, 73, 75, 77, prior to and after contactingmalignant pleural mesothelioma cells or tissues with candidatepharmacological agents for the treatment of malignant pleuralmesothelioma. The cells may be grown in culture (e.g. from an malignantpleural mesothelioma cell line), or may be obtained from a subject,(e.g. in a clinical trial of candidate pharmaceutical agents to treatmalignant pleural mesothelioma). Alterations in expression of two ormore sets of nucleic acid markers, from among SEQ ID NOs:9, 11, 13, 15,17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, in malignant pleuralmesothelioma cells or tissues tested before and after contact with acandidate pharmacological agent to treat malignant pleural mesothelioma,indicate progression, regression, or stasis of the malignant pleuralmesothelioma thereby indicating efficacy of candidate agents andconcomitant identification of lead compounds for therapeutic use inmalignant pleural mesothelioma.

The invention further provides efficient methods of identifyingpharmacological agents or lead compounds for agents active at the levelof malignant pleural mesothelioma cellular function. Generally, thescreening methods involve assaying for compounds that beneficially altermalignant pleural mesothelioma nucleic acid molecule expression. Suchmethods are adaptable to automated, high-throughput screening ofcompounds.

The assay mixture comprises a candidate pharmacological agent.Typically, a plurality of assay mixtures are run in parallel withdifferent agent concentrations to obtain a different response to thevarious concentrations. Typically, one of these concentrations serves asa negative control, i.e., at zero concentration of agent or at aconcentration of agent below the limits of assay detection. Candidateagents encompass numerous chemical classes, although typically they areorganic compounds. Preferably, the candidate pharmacological agents aresmall organic compounds, i.e., those having a molecular weight of morethan 50 yet less than about 2500, preferably less than about 1000 and,more preferably, less than about 500. Candidate agents comprisefunctional chemical groups necessary for structural interactions withpolypeptides and/or nucleic acids, and typically include at least anamine, carbonyl, hydroxyl, or carboxyl group, preferably at least two ofthe functional chemical groups and more preferably at least three of thefunctional chemical groups. The candidate agents can comprise cycliccarbon or heterocyclic structure and/or aromatic or polyaromaticstructures substituted with one or more of the above-identifiedfunctional groups. Candidate agents also can be biomolecules such aspeptides, saccharides, fatty acids, sterols, isoprenoids, purines,pyrimidines, derivatives or structural analogs of the above, orcombinations thereof and the like. Where the agent is a nucleic acid,the agent typically is a DNA or RNA molecule, although modified nucleicacids as defined herein are also contemplated.

Candidate agents are obtained from a wide variety of sources includinglibraries of synthetic or natural compounds. For example, numerous meansare available for random and directed synthesis of a wide variety oforganic compounds and biomolecules, including expression of randomizedoligonucleotides, synthetic organic combinatorial libraries, phagedisplay libraries of random peptides, and the like. Alternatively,libraries of natural compounds in the form of bacterial, fungal, plant,and animal extracts are available or readily produced. Additionally,natural and synthetically produced libraries and compounds can bereadily be modified through conventional chemical, physical, andbiochemical means. Further, known pharmacological agents may besubjected to directed or random chemical modifications such asacylation, alkylation, esterification, amidification, etc. to producestructural analogs of the agents.

A variety of other reagents also can be included in the mixture. Theseinclude reagents such as salts, buffers, neutral proteins (e.g.,albumin), detergents, etc. which may be used to facilitate optimalprotein-protein and/or protein-nucleic acid binding. Such a reagent mayalso reduce non-specific or background interactions of the reactioncomponents. Other reagents that improve the efficiency of the assay suchas protease, inhibitors, nuclease inhibitors, antimicrobial agents, andthe like may also be used.

The mixture of the foregoing assay materials is incubated underconditions whereby, the anti-cancer candidate agent specifically bindsthe cellular binding target, a portion thereof or analog thereof. Theorder of addition of components, incubation temperature, time ofincubation, and other parameters of the assay may be readily determined.Such experimentation merely involves optimization of the assayparameters, not the fundamental composition of the assay. Incubationtemperatures typically are between 4° C. and 40° C. Incubation timespreferably are minimized to facilitate rapid, high throughput screening,and typically are between 0.1 and 10 hours.

After incubation, the presence or absence of specific binding betweenthe anti-malignant pleural mesothelioma candidate agent and one or morebinding targets is detected by any convenient method available to theuser. For cell-free binding type assays, a separation step is often usedto separate bound from unbound components. The separation step may beaccomplished in a variety of ways. Conveniently, at least one of thecomponents is immobilized on a solid substrate, from which the unboundcomponents may be easily separated. The solid substrate can be made of awide variety of materials and in a wide variety of shapes, e.g.,microtiter plate, microbead, dipstick, resin particle, etc. Thesubstrate preferably is chosen to maximize signal-to-noise ratios,primarily to minimize background binding, as well as for ease ofseparation and cost.

Separation may be effected for example, by removing a bead or dipstickfrom a reservoir, emptying or diluting a reservoir such as a microtiterplate well, rinsing a bead, particle, chromatographic column or filterwith a wash solution or solvent. The separation step preferably includesmultiple rinses or washes. For example, when the solid substrate is amicrotiter plate, the wells may be washed several times with a washingsolution, which typically includes those components of the incubationmixture that do not participate in specific bindings such as salts,buffer, detergent, non-specific protein, etc. Where the solid substrateis a magnetic bead, the beads may be washed one or more times with awashing solution and isolated using a magnet.

Detection may be effected in any convenient way for cell-based assayssuch as two- or three-hybrid screens. The transcript resulting from areporter gene transcription assay of the anti-cancer agent binding to atarget molecule typically encodes a directly or indirectly detectableproduct, e.g., β-galactosidase activity, luciferase activity, and thelike. For cell-free binding assays, one of the components usuallycomprises, or is coupled to, a detectable label. A wide variety oflabels can be used, such as those that provide direct detection (e.g.,radioactivity, luminescence, optical, or electron density, etc) orindirect detection (e.g., epitope tag such as the FLAG epitope, enzymetag such as horseseradish peroxidase, etc.). The label may be bound toan anti-cancer agent binding partner, or incorporated into the structureof the binding partner.

A variety of methods may be used to detect the label, depending on thenature of the label and other assay components. For example, the labelmay be detected while bound to the solid substrate or subsequent toseparation from the solid substrate. Labels may be directly detectedthrough optical or electron density, radioactive emissions, nonradiativeenergy transfers, etc. or indirectly detected with antibody conjugates,strepavidin-biotin conjugates, etc. Methods for detecting the labels arewell known in the art.

The invention thus generally provides cancer gene- or protein-specificbinding agents, methods of identifying and making such agents, and theiruse in diagnosis, therapy and pharmaceutical development. For example,malignant pleural mesothelioma gene- or protein-specific pharmacologicalagents are useful in a variety of diagnostic and therapeuticapplications as described herein. In general, the specificity of ancancer gene or protein binding to a binding agent is shown by bindingequilibrium constants. Targets that are capable of selectively bindingan cancer gene preferably have binding equilibrium constants of at leastabout 10⁷ M⁻¹, more preferably at least about 10⁸ M⁻¹, and mostpreferably at least about 10⁹ M⁻¹. The wide variety of cell-based andcell-free assays may be used to demonstrate cancer gene-specificbinding. Cell-based assays include one, two and three hybrid screens,assays in which cancer gene-mediated transcription is inhibited orincreased, etc. Cell-free assays include cancer gene-protein bindingassays, immunoassays, etc. Other assays useful for screening agentswhich bind cancer polypeptides include fluorescence resonance energytransfer (FRET), and electrophoretic mobility shift analysis (EMSA).

In another aspect of the invention, pre- and post-treatment alterationsin expression of two or more sets of cancer nucleic acid markers, forexample malignant pleural mesothelioma cancer nucleic acid markersincluding, but not limited to, SEQ D NOs:9, 11, 13, 15, 17, 19, 21, 23,25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59,61, 63, 65, 67, 69, 71, 73, 75, 77, in cancer cells or tissues may beused to assess treatment parameters including, but not limited to:dosage, method of administration, timing of administration, andcombination with other treatments as described herein.

Candidate pharmacological agents may include antisense oligonucleotidesthat selectively bind to a cancer-related nucleic acid marker molecule,as identified herein, to reduce the expression of the marker moleculesin cancer cells and tissues. One of ordinary skill in the art can testof the effects of a reduction of expression of cancer nucleic acidmarker sequences in vivo or in vitro, to determine the efficacy of oneor more antisense oligonucleotides.

As used herein, the term “antisense oligonucleotide” or “antisense”describes an oligonucleotide that is an oligoribonucleotide,oligodeoxyribonucleotide, modified oligoribonucleotide, or modifiedoligodeoxyribonucleotide, which hybridizes under physiologicalconditions to DNA comprising a particular gene or to an mRNA transcriptof that gene and, thereby, inhibits the transcription of that geneand/or the translation of that mRNA. The antisense molecules aredesigned so as to interfere with transcription or translation of atarget gene upon hybridization with the target gene or transcript. Thoseskilled in the art will recognize that the exact length of the antisenseoligonucleotide and its degree of complementarity with its target willdepend upon the specific target selected, including the sequence of thetarget and the particular bases which comprise that sequence. It ispreferred that the antisense oligonucleotide be constructed and arrangedso as to bind selectively with the target under physiologicalconditions, i.e., to hybridize substantially more to the target sequencethan to any other sequence in the target cell under physiologicalconditions.

Based upon the sequences of cancer expressed nucleic acids, or uponallelic or homologous genomic and/or cDNA sequences, one of skill in theart can easily choose and synthesize any of a number of appropriateantisense molecules for use in accordance with the present invention. Inorder to be sufficiently selective and potent for inhibition, suchantisense oligonucleotides should comprise at least 10 and, morepreferably, at least 15 consecutive bases that are complementary to thetarget, although in certain cases modified oligonucleotides as short as7 bases in length have been used successfully as antisenseoligonucleotides (Wagner et al., 1996). Most preferably, the antisenseoligonucleotides comprise a complementary sequence of 20-30 bases.Although oligonucleotides may be chosen that are antisense to any regionof the gene or mRNA transcripts, in preferred embodiments the antisenseoligonucleotides correspond to N-terminal or 5′ upstream sites such astranslation initiation, transcription initiation, or promoter sites. Inaddition, 3′-untranslated regions may be targeted. Targeting to mRNAsplicing sites has also been used in the art but may be less preferredif alternative mRNA splicing occurs. In addition, the antisense istargeted, preferably, to sites in which mRNA secondary structure is notexpected (see, e.g., Sainio et al., 1994) and at which proteins are notexpected to bind. Finally, although the listed sequences are cDNAsequences, one of ordinary skill in the art may easily derive thegenomic DNA corresponding to the cDNA of an cancer expressedpolypeptide. Thus, the present invention also provides for antisenseoligonucleotides that are complementary to the genomic DNA correspondingto cancer expressed nucleic acids, e.g, the malignant pleuralmesothelioma nucleic acid markers described herein. Similarly, the useof antisense to allelic or homologous cDNAs and genomic DNAs are enabledwithout undue experimentation.

In one set of embodiments, the antisense oligonucleotides of theinvention may be composed of “natural” deoxyribonucleotides,ribonucleotides, or any combination thereof. That is, the 5′ end of onenative nucleotide and the 3′ end of another native nucleotide may becovalently linked, as in natural systems, via a phosphodiesterinternucleoside linkage. These oligonucleotides may be prepared byart-recognized methods, which may be carried out manually or by anautomated synthesizer. They also may be produced recombinantly byvectors.

In preferred embodiments, however, the antisense oligonucleotides of theinvention also may include “modified” oligonucleotides. That is, theoligonucleotides may be modified in a number of ways that do not preventthem from hybridizing to their target but which enhance their stabilityor targeting or which otherwise enhance their therapeutic effectiveness.The term “modified oligonucleotide” as used herein describes anoligonucleotide in which (1) at least two of its nucleotides arecovalently linked via a synthetic internucleoside linkage (i.e., alinkage other than a phosphodiester linkage between the 5′ end of onenucleotide and the 3′ end of another nucleotide) and/or (2) a chemicalgroup not normally associated with nucleic acids has been covalentlyattached to the oligonucleotide. Preferred synthetic internucleosidelinkages are phosphorothioates, alkylphosphonates, phosphorodithioates,phosphate esters, alkylphosphonothioates, phosphoramidates, carbamates,carbonates, phosphate triesters, acetamidates, carboxymethyl esters, andpeptides.

The term “modified oligonucleotide” also encompasses oligonucleotideswith a covalently modified base and/or sugar. For example, modifiedoligonucleotides include oligonucleotides having backbone sugars thatare covalently attached to low molecular weight organic groups otherthan a hydroxyl group at the 3′ position and other than a phosphategroup at the 5′ position. Thus modified oligonucleotides may include a2′-O-alkylated ribose group. In addition, modified oligonucleotides mayinclude sugars such as arabinose instead of ribose. The presentinvention, thus, contemplates pharmaceutical preparations containingmodified antisense molecules that are complementary to and hybridizablewith, under physiological conditions, malignant pleural mesotheliomaexpressed nucleic acids, together with pharmaceutically acceptablecarriers.

Antisense oligonucleotides may be administered as part of apharmaceutical composition. Such a pharmaceutical composition mayinclude the antisense oligonucleotides in combination with any standardphysiologically and/or pharmaceutically acceptable carriers which areknown in the art. The compositions should be sterile and contain atherapeutically effective amount of the antisense oligonucleotides in aunit of weight or volume suitable for administration to a patient. Theterm “pharmaceutically acceptable” means a non-toxic material that doesnot interfere with the effectiveness of the biological activity of theactive ingredients. The term “physiologically acceptable” refers to anon-toxic material that is compatible with a biological system such as acell, cell culture, tissue, or organism. The characteristics of thecarrier will depend on the route of administration. Physiologically andpharmaceutically acceptable carriers include diluents, fillers, salts,buffers, stabilizers, solubilizers, and other materials, which are wellknown in the art.

Expression of cancer nucleic acid molecules can also be determined usingprotein measurement methods, e.g., for use in the ratio-based diagnosticand prognostic methods described herein. For example, the expression ofmalignant pleural mesothelioma genes such as SEQ ID NOs:9, 11, 13, 15,17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, can be determined byexamining the expression of polypeptides encoded by SEQ ID NOs:9, 11,13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 (exemplarytranslations are provided herein as SEQ ID NOs: 10, 12, 14, 16, 18, 20,22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56,53, 60, 62, 64, 66, 63, 70, 72, 74, 76, 78). Preferred methods ofspecifically and quantitatively measuring proteins include, but are notlimited to: mass spectroscopy-based methods such as surface enhancedlaser desorption ionization (SELDI; e.g., Ciphergen ProteinChip System),non-mass spectroscopy-based methods, immunoassay methods such as ELISAand immunohistochemistry-based methods such as 2-dimensional gelelectrophoresis.

SELDI methodology may, through procedures known to those of ordinaryskill in the art, be used to vaporize microscopic amounts of tumorprotein and to create a “fingerprint” of individual proteins, therebyallowing simultaneous measurement of the abundance of many proteins in asingle sample. Preferably SELDI-based assays may be utilized to classifytumors. Such assays preferably include, but are not limited to thefollowing examples. Gene products discovered by RNA microarrays may beselectively measured by specific (antibody mediated) capture to theSELDI protein disc (e.g., selective SELDI). Gene products discovered byprotein screening (e.g., with 2-D gels), may be resolved by “totalprotein SELDI” optimized to visualize those particular markers ofinterest from among polypeptides encoded by SEQ ID NOs:9, 11, 13, 15,17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51,53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 (e.g., SEQ ID NOs:10,12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46,48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78).Predictive models of tumor classification from SELDI measurement ofmultiple markers from among polypeptides encoded by SEQ ID NOs:9, 11,13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47,49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 (e.g., SEQ IDNOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78)may be utilized for the SELDI strategies.

The invention also involves agents such as polypeptides that bind tomalignant pleural mesothelioma-associated polypeptides, e.g., SEQ IDNOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78.Such binding agents can be used, for example, in screening assays todetect the presence or absence of malignant pleuralmesothelioma-associated polypeptides and complexes of malignant pleuralmesothelioma-associated polypeptides and their binding partners and inpurification protocols to isolate malignant pleuralmesothelioma-associated polypeptides and complexes of malignant pleuralmesothelioma-associated polypeptides and their binding partners. Suchagents also may be used to inhibit the native activity of the malignantpleural mesothelioma-associated polypeptides, for example, by binding tosuch polypeptides.

The invention, therefore, embraces peptide binding agents which, forexample, can be antibodies or fragments of antibodies having the abilityto selectively bind to malignant pleural mesothelioma-associatedpolypeptides. Antibodies include polyclonal and monoclonal antibodies,prepared according to conventional methodology.

Significantly, as is well-known in the art, only a small portion of anantibody molecule, the paratope, is involved in the binding of theantibody to its epitope (see, in general, Clark, W. R. (1986) TheExperimental Foundations of Modern Immunology Wiley & Sons, Inc., NewYork; Roitt, I. (1991) Essential Immunology, 7th Ed., BlackwellScientific Publications, Oxford). The pFc′ and Fc regions, for example,are effectors of the complement cascade but are not involved in antigenbinding. An antibody from which the pFc′ region has been enzymaticallycleaved, or which has been produced without the pFc′ region, designatedan F(ab′)₂ fragment, retains both of the antigen binding sites of anintact antibody. Similarly, an antibody from which the Fc region hasbeen enzymatically cleaved, or which has been produced without the Fcregion, designated an Fab fragment, retains one of the antigen bindingsites of an intact antibody molecule. Proceeding further, Fab fragmentsconsist of a covalently bound antibody light chain and a portion of theantibody heavy chain denoted Fd. The Fd fragments are the majordeterminant of antibody specificity (a single Fd fragment may beassociated with up to ten different light chains without alteringantibody specificity) and Fd fragments retain epitope-binding ability inisolation.

Within the antigen-binding portion of an antibody, as is well-known inthe art, there are complementarity determining regions (CDRs), whichdirectly interact with the epitope of the antigen, and framework regions(FRs), which maintain the tertiary structure of the paratope (see, ingeneral, Clark, 1986; Roitt, 1991). In both the heavy chain Fd fragmentand the light chain of IgG immunoglobulins, there are four frameworkregions (FR1 through FR4) separated respectively by threecomplementarity determining regions (CDR1 through CDR3). The CDRs, andin particular the CDR3 regions, and more particularly the heavy chainCDR3, are largely responsible for antibody specificity.

It is now well-established in the art that the non-CDR regions of amammalian antibody may be replaced with similar regions of conspecificor heterospecific antibodies while retaining the epitopic specificity ofthe original antibody. This is most clearly manifested in thedevelopment and use of “humanized” antibodies in which non-human CDRsare covalently joined to human FR′ and/or Fc/pFc′ regions to produce afunctional antibody. See, e.g., U.S. Pat. Nos. 4,816,567, 5,225,539,5,585,089, 5,693,762 and 5,859,205.

Fully human monoclonal antibodies also can be prepared by immunizingmice transgenic for large portions of human immunoglobulin heavy andlight chain loci. Following immunization of these mice (e.g., XenoMouse(Abgenix), HuMAb mice (Medarex/GenPharm)), monoclonal antibodies can beprepared according to standard hybridoma technology. These monoclonalantibodies will have human immunoglobulin amino acid sequences andtherefore will not provoke human anti-mouse antibody (HAMA) responseswhen administered to humans.

Thus, as will be apparent to one of ordinary skill in the art, thepresent invention also provides for F(ab′)₂, Fab, Fv and Fd fragments;chimeric antibodies in which the Fc and/or FR and/or CDR1 and/or CDR2and/or light chain CDR3 regions have been replaced by homologous humanor non-human sequences; chimeric F(ab′)₂ fragment antibodies in whichthe FR and/or CDR1 and/or CDR2 and/or light chain CDR3 regions have beenreplaced by homologous human or non-human sequences; chimeric Fabfragment antibodies in which the FR and/or CDR1 and/or CDR2 and/or lightchain CDR3 regions have been replaced by homologous human or non-humansequences; and chimeric Fd fragment antibodies in which the FR and/orCDR1 and/or CDR2 regions have been replaced by homologous human ornon-human sequences. The present invention also includes so-calledsingle chain antibodies.

Thus, the invention involves the use of polypeptides of numerous sizeand type that bind specifically to polypeptides selected from thoseencoded by SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69,71, 73, 75, 77 (e.g., SEQ ID NOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28,30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64,66, 68, 70, 72, 74, 76, 78), and complexes of both malignant pleuralmesothelioma-associated polypeptides and their binding partners. Thesepolypeptides may be derived also from sources other than antibodytechnology. For example, such polypeptide binding agents can be providedby degenerate peptide libraries which can be readily prepared insolution, in immobilized form or as phage display libraries.Combinatorial libraries also can be synthesized of peptides containingone or more amino acids. Libraries further can be synthesized ofpeptoids and non-peptide synthetic moieties.

Phage display can be particularly effective in identifying bindingpeptides useful according to the invention. Briefly, one prepares aphage library (using e.g. m13, fd, or lambda phage), displaying insertsfrom 4 to about 80 amino acid residues using conventional procedures.The inserts may represent, for example, a completely degenerate orbiased array. One then can select phage-bearing inserts which bind tothe malignant pleural mesothelioma-associated polypeptide. This processcan be repeated through several cycles of reselection of phage that bindto the malignant pleural mesothelioma-associated polypeptide. Repeatedrounds lead to enrichment of phage bearing particular sequences. DNAsequence analysis can be conducted to identify the sequences of theexpressed polypeptides. The minimal linear portion of the sequence thatbinds to the malignant pleural mesothelioma-associated polypeptide canbe determined. One can repeat the procedure using a biased librarycontaining inserts containing part or all of the minimal linear portionplus one or more additional degenerate residues upstream or downstreamthereof. Yeast two-hybrid screening methods also may be used to identifypolypeptides that bind to the malignant pleural mesothelioma-associatedpolypeptides.

Thus, the malignant pleural mesothelioma-associated polypeptides of theinvention, including fragments thereof, can be used to screen peptidelibraries, including phage display libraries, to identify and selectpeptide binding partners of the malignant pleuralmesothelioma-associated polypeptides of the invention. Such moleculescan be used, as described, for screening assays, for purificationprotocols, for interfering directly with the functioning of malignantpleural mesothelioma-associated polypeptides and for other purposes thatwill be apparent to those of ordinary skill in the art. For example,isolated malignant pleural mesothelioma-associated polypeptides can beattached to a substrate (e.g., chromatographic media, such aspolystyrene beads, a filter, or an array substrate), and then a solutionsuspected of containing the binding partner may be applied to thesubstrate. If a binding partner that can interact with malignant pleuralmesothelioma-associated polypeptides is present in the solution, then itwill bind to the substrate-malignant pleural mesothelioma-associatedpolypeptide. The binding partner then may be isolated.

As detailed herein, the foregoing antibodies and other binding moleculesmay be used for example, to identify tissues expressing protein or topurify protein. Antibodies also may be coupled to specific diagnosticlabeling agents for imaging of cells and tissues that express malignantpleural mesothelioma-associated polypeptides or to therapeuticallyuseful agents according to standard coupling procedures. Diagnosticagents include, but are not limited to, barium sulfate, iocetamic acid,iopanoic acid, ipodate calcium, diatrizoate sodium, diatrizoatemeglumine, metrizamide, tyropanoate sodium and radiodiagnosticsincluding positron emitters such as fluorine-18 and carbon-11, gammaemitters such as iodine-123, technitium-99m, iodine-131 and indium-111,nuclides for nuclear magnetic resonance such as fluorine and gadolinium.

The invention further includes protein microarrays for analyzingexpression of malignant pleural mesothelioma-associated peptidesselected from those encoded by SEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23,25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59,61, 63, 65, 67, 69, 71, 73, 75, 77 (e.g., SEQ ID NOs.10, 12, 14, 16, 18,20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54,56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78). In this aspect of theinvention, standard techniques of microarray technology are utilized toassess expression of the malignant pleural mesothelioma-associatedpolypeptides and/or identify biological constituents that bind suchpolypeptides. The constituents of biological samples include antibodies,lymphocytes particularly T lymphocytes), and the like. Proteinmicroarray technology, which is also known by other names including:protein chip technology and solid-phase protein array technology, iswell known to those of ordinary skill in the art and is based on, butnot limited to, obtaining an array of identified peptides or proteins ona fixed substrate, binding target molecules or biological constituentsto the peptides, and evaluating such binding. See, e.g., G. MacBeath andS. L. Schreiber, “Printing Proteins as Microarrays for High-ThroughputFunction Determination,” Science 289(5485):1760-1763, 2000.

Preferably antibodies or antigen binding fragments thereof thatspecifically bind polypeptides selected from the group consisting ofthose encoded by SEQ ID NOs.9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29,31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65,67, 69, 71, 73, 75, 77 (e.g., SEQ ID NOs:10, 12, 14, 16, 18, 20, 22, 24,26, 28, 30, 32, 34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60,62, 64, 66, 68, 70, 72, 74, 76, 78) are attached to the microarraysubstrate in accordance with standard attachment methods known in theart. These arrays can be used to quantify the expression of thepolypeptides identified herein.

In some embodiments of the invention, one or more control peptide orprotein molecules are attached to the substrate. Preferably, controlpeptide or protein molecules allow determination of factors such aspeptide or protein quality and binding characteristics, reagent qualityand effectiveness, hybridization success, and analysis thresholds andsuccess.

The use of such methods to determine expression of malignant pleuralmesothelioma nucleic acids from among SEQ ID NOs:9, 11, 13, 15, 17, 19,21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55,57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77 and/or proteins encoded bySEQ ID NOs:9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37,39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73,75, 77 (e.g., SEQ ID NOs:10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32,34, 36, 38, 40, 42, 44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66, 68,70, 72, 74, 76, 78) can be done with routine methods known to those ofordinary skill in the art and the expression determined by proteinmeasurement methods may be used as a prognostic method for selectingtreatment strategies for malignant pleural mesothelioma patients.

EXAMPLES Example 1 Diagnosis of Thoracic Malignancies Using GeneExpression Ratios Introduction

Malignant pleural mesothelioma (MPM) is a highly lethal pleuralmalignancy (1). Patients with MPM frequently present with a malignantunilateral pleural effusion or pleural thickening. However,adenocarcinoma (ADCA) metastatic to the pleura of lung or other originis a far more common etiology for patients presenting with a unilateralpleural effusion (1). The ultimate treatment strategies depend on thecorrect pathological diagnosis. Early MPM is best treated withextrapleural pneumonectomy followed by chemoradiation, whereasmetastatic lung cancer is treated with chemotherapy alone (2). Notinfrequently, distinguishing MPM from ADCA of the lung is challengingfrom both clinical and pathological perspectives (3). Fluid cytology isdiagnostic in only 33% of the cases (2, 4) and sufficient additionaltissue from an open surgical biopsy is often required forimmunohistochemistry and cytogenetic analysis (1).

Current bioinformatics tools recently applied to microarray data haveshown utility in predicting both cancer diagnosis (5) and outcome (6).Though highly accurate, their widespread clinical relevance andapplicability are unresolved. The minimum number of predictor genes isnot known, and the discrimination function can vary (for the same genes)based on the location and protocol used for sample preparation (5).Profiling with microarray requires relatively large quantities of RNAmaking the process inappropriate for certain applications. Also, it hasyet to be determined if these approaches can utilize relatively low-costand widely available data acquisition platforms such as RT-PCR and stillretain significant predictive capabilities. Finally, the majorlimitation in translating microarray profiling to patient care is thatthis approach cannot currently be used to diagnose individual samplesindependently and without comparison to a predictor model generated fromsamples whose data was acquired on the same platform.

In this study we have explored an alternative approach using geneexpression measurements to predict clinical parameters in cancer.Specifically, we have explored the feasibility of a simple, inexpensivetest with widespread applicability that utilizes ratios of geneexpression levels and rationally chosen thresholds to accuratelydistinguish between genetically disparate tissues. This approachcircumvents many of the problems that prevent the penetration ofexpression profiling research into the clinical setting. We found thatexpression ratio-based diagnosis of MPM and lung cancer was similarlyaccurate compared to standard statistical methods of classdiscrimination such as linear discrimination analysis (7) and similarmodels (5) while addressing many of their deficiencies.

Materials and Methods

Tumor tissues. A combined total of 245 discarded MPM and lung ADCAsurgical specimens were freshly collected (and snap frozen) frompatients who underwent surgery at Brigham and Women's Hospital (BWH)between 1993 and 2001. Lung ADCA tumors consisted of both primarymalignancies and metastatic ADCAs of breast and colon origin. All MPMsamples used in these studies contained relatively pure tumor (greaterthan 50% tumor cells in a high power field examined in a sectionadjacent to the tissue used). Linked clinical and pathological data wereobtained for all patients who contributed tumor specimens and renderedanonymous to protect patient confidentiality. Studies utilizing humantissues were approved by and conducted in accordance with the policiesof the Institutional Review Board at BWH.

Microarray experiments. Total RNA (7 μg) was prepared from whole tumorblocks using Trizol Reagent (Invitrogen Life Technologies, Carlsbad,Calif.) and processed as described (8-10). cRNA was hybridized to humanU95A oligonucleotide probe arrays (Affymetrix, Santa Clara, Calif.)using a protocol described previously (10). Data from 64 of 245 sampleswere discarded after visual inspection of hybridization data revealedobvious scanning artifacts, leaving a total of 31 MPM samples and 150ADCA samples (139 patient tumors and 11 duplicates). Microarrays for allADCA samples and 12 MPM samples were processed at the Dana-Farber CancerInstitute and the Whitehead Institute. The remaining 19 MPM samples wereprocessed separately at BWH. Microarray data for the ADCA samples hasbeen previously published (11). Bhattacharjee and colleagues usedmicroarray data from ADCAs utilized in this study in combination withadditional samples but not MPM, to identify distinct subclasses withinADCA of the lung and to search for prognostic markers. However, theirstudy did not provide any comparison of gene expression between ADCA andMPM.

Real time quantitative RT-PCR. Total RNA (2 μg) was reverse-transcribedinto cDNA using Taq-Man Reverse Transcription reagents (AppliedBiosystems, Foster City, Calif.) and quantified using all recommendedcontrols for SYBR Green-based detection. Primers amplifying portions ofclaudin-7, VAC-β, TACSTD1, and calretinin cDNA (synthesized byInvitrogen Life Technologies) had the following sequences (forward andreverse):

claudin-7 5′-GTTCCTGTCCTGGGAATGAG-3′ (SEQ ID NO: 87) and5′-AAGGAGATCCCAGGTCACAC-3′; (SEQ ID NO: 88) VAC-β5′-CCAGCCTTTCGGTCTTCTAT-3′ (SEQ ID NO: 89) and5′-CTGGAGGAAGTTGGGAAGAG-3′; (SEQ ID NO: 90) TACSTD15′-AGCAGCTTGAAACTGGCTTT-3′ (SEQ ID NO: 91) and5′-AACGATGGAGTCCAAGTTCTG-3′; (SEQ ID NO: 92) and calretinin5′-AGGACCTGGAGATTGTGCTC-3′ (SEQ ID NO: 93) and5-GAGTCTGGGTAGACGCATCA-3′. (SEQ ID NO: 94)

Data analysis. Gene expression levels were appropriately scaled tofacilitate comparison of data from arrays hybridized at different timesand/or using multiple scanners. When the “average difference” wasnegative (i.e. negligible expression level), the absolute value wasused. A two-tailed students t-test was used to compare the log (geneexpression levels) for all 12,600 genes on the microarray betweensamples from a training set consisting of 16 MPM and 16 ADCA samples.All differences in the mean log (expression levels) between the samplesin the two groups in the training set were determined to bestatistically significant if P<2×10⁻⁶. Statistical comparisons(including linear discrimination analysis) were performed using S-PLUS(12). To generate the graphical representations of relative geneexpression levels, all expression levels were first normalized withinsamples by setting the average (mean) to 0 and the standard deviationto 1. Scaled levels were assigned RGB values (representing 20 shades)for colorimetric display as a spectrum representing relative geneexpression levels.

Results

Identification of Diagnostic Molecular Markers. We searched all of thegenes represented on the microarray for those with a highly significantdifference (P<2×10⁻⁶, ≧8-fold) in average expression levels between bothtumor types in the training set of 16 ADCA and 16 MPM samples. Forfurther analysis, we chose the 8 genes with the most statisticallysignificant differences and a mean expression level >600 in at least oneof the two training sample sets (gene name, GenBank Accession #):calretinin, X56667, (P=8×10⁻¹²), VAC-β, X16662, (P==8×10 ⁻¹³), TACSTD1,M93036, (P=6×10⁻¹²), claudin-7, AJ011497, (P=2×10⁻⁹), thyroidtranscription factor-1 (TITF-1), U43203, (P=10⁻⁹), MRC OX-2 antigen,X05323, (P=5×10⁻¹³), prostacyclin synthase (PTGIS), D83402, (P=10⁻¹⁰),and hypothetical protein KIAA0977, AB023194, (P=9×10⁻¹¹). Five of thesegenes were expressed at relatively higher levels in MPM tumors(calretinin, VAC-β, MRC OX-2, PTGIS, and KIAA0977) and 3 were expressedat relatively higher levels in ADCA tumors (TACSTD1, claudin-7, andTITF-1). We then investigated whether expression patterns of these genesextended to all samples (FIG. 1A).

Diagnostic Accuracy of Gene Expression Ratios. Using the 8 genesidentified in the initial training set, we calculated 15 expressionratios per sample by dividing the expression value of each of the 5genes expressed at relatively higher levels in MPM by the expressionvalue of each of the 3 genes expressed at relatively higher levels inADCA. Then, we tested the diagnostic accuracy of these ratios in the 149remaining samples not included in the training set (i.e. 15 MPM and 134ADCA). Samples with ratio values >1 were called MPM and those with ratiovalues <1 were called ADCA. We found that these ratios could be used tocorrectly distinguish between ADCA and MPM tumors with a high degree ofaccuracy (Table 1).

TABLE 1 Accuracy of all ratio combinations in predicting tumor diagnosisin test set Claudin-7 TACSTD1 TITF-1 Calretinin 97% (145/149) 98%(146/149) 91% (136/149) VAC-β 97% (144/149) 97% (145/149) 94% (140/149)MRC OX-2 97% (145/149) 97% (145/149) 95% (142/149) KIAA097 97% (145/149)95% (142/149) 94% (140/149) PTGIS 97% (145/149) 97% (144/149) 96%(143/149)

Eight candidate diagnostic genes were identified in a training set ofsamples as described in the Methods. A total of fifteen possibleexpression ratios (column/row intersection) were calculated where bothgenes used to form the ratio possessed inversely correlated expressionlevels in both tumor types. The accuracy of each ratio in predictingdiagnosis was examined in the 149 remaining tumor specimens not includedin the training set (15 mesothelioma and 134 adenocarcinoma).Predictions are stated as the fraction diagnosed correctly.

To incorporate data from multiple ratios, we then randomly chose a pairof independent ratios (calretinin/claudin-7 and VAC-β/TACSTD1) andexamined their predictive accuracy in the test set. Each ratio(calretinin/claudin-7 and VAC-β/TACSTD1) was 97% (145/149) accurate with4 errors (FIGS. 1B and 1C). Thus, a total of 8 samples were incorrectlydiagnosed using either ratio. However, these two ratios were indisagreement for all 8 incorrectly diagnosed samples (FIG. 1C). When thediagnostic call of both ratios is combined, the final analysis resultsin 95% (141/149) of tumors correctly diagnosed with 0 errors and 8no-calls. No-calls were conservatively made for samples when both ratiosdid not return the same diagnosis (FIG. 1C). To predict a diagnosis forthe 8 no-calls, we randomly chose an additional ratio (MRC OX-2/TITF-1.Table 1). The addition of a third ratio established a majority diagnosisfor the 8 samples that could not previously be determined using only tworatios. Using all 3 ratios (i.e. 6 genes), 99% (148/149) of tumors werecorrectly diagnosed; 7 no-calls were resolved and 1 sample wasincorrectly diagnosed.

Comparison with Linear Discrimination Analysis. Standard statisticalmethods of class discrimination (7), such as linear discriminationanalysis, can also be used to achieve similar results for these threepairs of genes. We first determined a linear combination of measuredexpression levels for each pair of genes that provided maximaldiscrimination between the two sets of tumor samples in the trainingset. When applied to the test set samples, the linear discriminationfunctions for the (calretinin, claudin-7), (VAC-β, TACSTD1), and (MRCOX-2, TITF-1) pairs each gave 6, 5, and 4 misclassifications,respectively. However, only one sample was incorrectly diagnosed in allthree tests combined. In fact, the same errant sample was identified inthe application of both the three ratio tests and the three lineardiscriminant tests. This sample was originally obtained from a patientwith the clinical and pathological diagnosis of ADCA. This specimen wasannotated by a pathologist reviewing frozen sections of all specimensprior to RNA preparation as having unusual histological features raisingsuspicion of a “germ cell tumor or sarcoma”.

Verification of Microarray Data and Validation of Ratio-Based Diagnosis.We utilized real time quantitative RT-PCR (i) to confirm gene expressionlevels of diagnostic molecular markers identified in microarray-basedanalysis and (ii) to demonstrate that ratio-based diagnosis of MPM andlung cancer is equally accurate using data obtained from anotherplatform. We randomly chose 12 tumor samples each of MPM and ADCA fromthose used in microarray analysis then calculated expression ratios forcalretinin/claudin-7 and VAC-β/TACSTD1. Expression ratios correctlydiagnosed 96% (23/24) of samples, with errors and 1 no-call FIG. 2).

We have also explored the usefulness of expression ratios in predictingclinical parameters under more challenging circumstances, i.e. whenpredictor genes have substantially higher P values and smallerfold-change differences in average expression levels. In this analysiswe used previously published microarray data (6) for a set of 60medulloblastoma tumors with linked clinical data (Dataset “C”) to createa ratio-based test designed to predict patient outcome after treatment.Of these 60 samples, 39 and 21 originated from patients classified as“treatment responders” and “treatment failures”, respectively. We used atraining set composed of 20 randomly chosen samples (10 responders and10 failures) to identify predictor genes. A total of 10 genes fit ourfiltering criteria (P<0.05, >2-fold change in average expression levels,at least one mean >200) and we chose the most significant three genesexpressed at relatively higher levels in each group for further analysis(gene name, GenBank Accession #): histone 2A, M37583, (P=0.012), GTPaserho C, L25081, (P=0.026), protein gene product 9.5, X04741, (P=0.046),neurofilament-66, S78296, (P=0.0025), sulfonylurea receptor, U63455,(P=0.0067), cell surface protein HCAR, U90716, (P=0.030). Using thepreviously stated diagnostic criteria, we calculated a total of 9possible expression ratios using data from these 6 genes and examinedtheir predictive accuracy in the remaining samples (i.e., the test set,n=0.29 responders and n=11 failures). A total of 5 ratios were equallyaccurate (75%, 30/40) in predicting test set samples and, incombination, utilized all 6 predictor genes. Our accuracy rate in a truetest set of samples is similar to that reported by Pomeroy andcolleagues (78%, 47/60) using all 60 samples to develop an 8-genek-nearest neighbor predictor model (6). To incorporate the predictiveaccuracy of multiple ratios (and genes), we calculated the geometricmean of these 5 ratios to give equal weight to ratios with identicalmagnitude but opposite direction. Finally, we performed Kaplan-Meiersurvival analysis using predictions made from the geometric mean value.We found that a 6-gene (5-ratio) model could significantly (P=0.00357,log-rank test) predict patient outcome after treatment in the test setof samples (FIG. 3). This P value is moderately lower than that reportedby Pomeroy et al. (P=0.009) using all 60 samples to assess their 8-genek-nearest neighbor predictor model (6). There was no overlap in the listof genes comprising our model and that of these investigators,suggesting that multiple genes are present in this malignancy that havesimilar predictive capability.

Discussion

Accurate diagnosis of cancer (or any disease) is the first critical stepin choosing appropriate treatments that will hopefully result in thebest possible outcome. We propose that the ratio-based method describedherein that utilizes expression levels of carefully chosen genes can bea simple, inexpensive, and highly accurate means to distinguish MPM fromADCA of the lung and that this method is applicable to many otherclinical scenarios. We have also shown that multiple highly accurateratios can be combined to form a simple diagnostic tool using the ratiodirection (“majority rules” approach, e.g., MPM and lung cancerdiagnosis) or the ratio magnitude (calculation of the geometric mean,e.g., prediction of outcome in medulloblastoma). The gene expressionratio method, by virtue of the fact that it is a ratio (i) negates theneed for a third reference gene when determining expression levels, (ii)is independent of platform used for data acquisition, (iii) requiresonly small quantities of RNA (as little as 10 pg using RT-PCR), (iv)does not require the coupling of transcription to translation for chosengenes, and (v) permits analysis of individual samples without referenceto additional “training samples” whose data was acquired on the sameplatform. For these reasons, expression ratios are more likely to findimmediate use in clinical settings since they to confer severaladvantages compared to other equally accurate techniques, such as lineardiscriminant analysis.

The small P values and large fold-differences in average expressionlevels between genes used in expression ratio-based diagnosis of MPM andlung cancer are not surprising given that both tumor types havedifferent cell types of origin. It is important to note that we have notdetermined in the current study the exact magnitude and consistency bywhich gene expression needs to differ between any two groups to allowthe usage of a simple ratio test. In other clinical scenarios thedifferences in gene expression patterns between groups to bedistinguished may be more subtle, thus necessitating a relaxed filteringcriteria in choosing potential predictor genes. Even in these cases,simple ratios can still be a highly accurate means of predictingclinical parameters. We have also found that expression ratios areuseful in predicting outcome after therapy in MPM using genes withconsiderably higher P values and lower fold-differences in averageexpression levels than those used in the current study (Gordon et al.,manuscript submitted). In the current study, we have used previouslypublished microarray data (6) to identify a small number of predictorgenes that were able to significantly predict outcome after therapy inmedulloblastoma in a true test set of samples using simple expressionratios. Nevertheless, in some cases larger numbers of genes (and perhapssophisticated software) and/or initial expression profiling of a largernumber of specimens for the training set may be required to achieveacceptable predictive power.

The selection of diagnostic genes for MPM and lung cancer was basedsolely on our stated criteria. Nevertheless, many of the molecularmarkers with the lowest P values and greatest difference in averageexpression levels have notable cancer relevance and/or are known to havetissue specific expression patterns. Calretinin (13, 14) and TITF-1 (15,16) are part of several immunohistochemical panels currently used in thediagnosis of MPM and lung cancer. Claudin family members are expressedin various cancers (17, 18) and TACSDT1 (alias TROP1) is a recentlydescribed marker for carcinoma cells and, as a cell surface receptorprotein, has been postulated to play a role in growth regulation oftumor cells (19, 20). The discovery of diagnostic gene ratios is likelyto make possible future clinical tests to definitively diagnose MPM andADCA using smaller tissue specimens and perhaps pleural effusions. Inthis way the need for diagnostic surgery in many of these patients maybe eliminated.

The expression ratio technique represents a substantial improvement overpast efforts to translate the strengths of expression profiling intosimple tests with clinical relevancy. Many bioinformatics tools underdevelopment and testing are quite complex and/or rely upon data fromlarge numbers of “training samples” to establish a diagnosis for unknownsamples. The end result is that the practical use of microarray dataremains beyond the scope of many scientists and clinicians. Similarly,no comprehensive method has been proposed to translate the results oftumor profiling to the analysis of individual tissues. As a consequence,no simple yet effective clinical applications have resulted frommicroarray research. The expression ratio technique represents apowerful use of microarray data that can be easily adapted and extendedto routine clinical application without the need for additionalsophisticated analysis.

REFERENCES FOR EXAMPLE 1

-   1. Aisner, J. Diagnosis, staging, and natural history of pleural    mesothelioma. In: J. Aisner, R. Arriagada, M. R. Green, N. Martini,    and M. C. Perry (eds.), Comprehensive Textbook of Thoracic Oncology,    pp. 799-785. Baltimore: Williams and Wilkins, 1996.-   2. Pass, H. Malignant pleural mesothelioma: Surgical roles and novel    therapies, Clin Lung Cancer. 3: 102-117, 2001.-   3. Ordonez, N. G. The immunohistochemical diagnosis of epithelial    mesothelioma, Hum Pathol. 30: 313-323, 1999.-   4. Nguyen, G.-K., Akin, M.-R. M., Villanueva, R. R., and Slatnik, J.    Cytopathology of malignant mesothelioma of the pleura in fine-needle    aspiration biopsy, Diagn Cytopathol. 21: 253-259, 1999.-   5. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek,    M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R.,    Caligiuri, M. A., Bloomfield, C. D., and Landers, E. S. Molecular    classification of cancer: class discovery and class prediction by    gene expression monitoring, Science. 286: 531-537, 1999.-   6. Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M.,    Angelo, M., McLaughlin, M. E., Kim, J. Y. H., Goumnerova, L. C.,    Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M.,    Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukheiee, S.,    Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J.    P., Lander, E. S., and Golub, T. R. Prediction of central nervous    system embryonal tumor outcome based on gene expression, Nature.    415: 436-442, 2002.-   7. Dudoit, S., Fridlyand, J., and Speed, T. P. Comparison of    discrimination methods for the classification of tumors using gene    expression data, J Am Stat Assoc. In Press:, 2002.-   8. Wang, K., Gan, L., Jeffery, E., Gayle, M., Gown, A. M., Skelly,    M., Nelson, P. S., Ng, W. V., Schummer, M., Hood, L., and    Mulligan, J. Monitoring gene expression profile changes in ovarian    carcinomas using cDNA microarrays, Gene. 229: 101-108, 1999.-   9. Warrington, J. A., Nair, A., Hahadevappa, M., and Tsyganskaya, M.    Comparison of human adult and fetal expression and identification of    535 housekeeping/maintenance genes, Physiol Genomics. 2: 143-147,    2000.-   10. O'Dell, S. D., Bujac, S. R., Miller, G. J., and Day, I. N.    Associations of IGF2 ApaI RFLP and INS VNTR class I allele size with    obesity, Eur J Hum Genet. 7: 565-576, 1999.-   11. Bhattacharjee, A., Richards, W. G., Staunton, J., Li, C., Monti,    S., Vasa, P., Ladd, C., Behesbti, J., Dueno, R., Gillette, M., Loda,    M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E.,    Golub, T. R., Sugarbaker, D. J., and Meyerson, M. Classification of    human lung carcinomas by mRNA expression profiling reveals distinct    adenocarcinoma sub-classes, Proc Natl Acad Sci USA. 98: 13790-13795,    2001.-   12. Venables, W. N. and Riley, B. D. Modern Applied Statistics with    S-Plus. New York: Springer, 1997.-   13. Lozano, M. D., Panizo, A., Toledo, G. R., Sola, J. J., and    Pardo-Mindan, J. Immunocytochemistry in the differential diagnosis    of serous effusions: a comparative evaluation of eight monoclonal    antibodies in Papanicolaou stained smears, Cancer. 93: 68-72, 2001.-   14. Sato, S., Okamoto, S., Ito, K., Konno, R., and Yajima, A.    Differential diagnosis of mesothelial and ovarian cancer cells in    ascites by immunocytochemistry using Ber-EP4 and calretinin, Acta    Cytol. 44: 485-488, 2000-   15. Di Loreto, C., Puglisi, F., Di Lauro, V., Damanate, G., and    Beltrami, C. A. TTF-1 protein expression in malignant pleural    mesotheliomas and adenocarcinomas of the lung, Cancer Lett. 124:    73-78, 1998.-   16. Ordonez, N. G. The value of antibodies 44-36A, SM3, HBME-1, and    thrombomodulin in differentiating epithelial pleural mesothelioma    from lung adenocarcinoma, Am J Surg Pathol. 21: 1399-1408, 1997.-   17. Michl, P., Buchholz, M., Rolke, M., Kunsch, S., Lohr, M.,    McClane, B., Tsukita, S., Leder, G., Adler, G., and Gress, T. M.    Claudin-4: a new target for pancreatic cancer treatment using    Clostridium perfringens enterotoxin, Gastroenterology. 121: 678-684,    2001.-   18. Hough, C. D., Sherman-Baust, C. A., Pizer, E. S., Montz, F. J.,    In, D. D., Rosenshein, N. B., Cho, K. R., Riggins, G. J., and    Morin, P. J. Large-scale serial analysis of gene expression reveals    genes differentially expressed in ovarian cancer, Cancer. 60:    6281-6287, 2000.-   19. Albeit, S., Nutini, M., and Herzenberg, L. A. DNA methylation    prevents the amplification of TROP 1, a tumor-associated cell    surface antigen gene, Proc Natl Acad Sci USA. 91: 5833-5837, 1994.-   20. Calabrese, G., Crescenzi, C., Morizio, E., Palka, G., Cuerra,    E., and Alberti, S. Assignment of TACSTD1 (alias TROP1, M4S1) to    human chromosome 2p21 and the refinement of mapping of TACSTD2    (alias TROP2, M1S1) to human chromosome 1p32 by in situ    hybridization, Cytogenet Cell Genet. 92: 164-165, 2001.

Example 2 Molecular Markers for Malignant Pleural MesotheliomaIntroduction

In this study we have refined a gene expression measurements approach topredict clinical parameters in cancer, including distinguishing betweensubclasses of malignant pleural mesothelioma and distinguishing betweenmalignant pleural mesothelioma and lung adenocarcinoma. We have foundthat ratios of gene expression levels can accurately distinguish betweengenetically disparate tissues.

Methods

MPM tissues. Discarded malignant pleural mesothelioma (MPM) surgicalspecimens were freshly collected from patients undergoing pleurectomy orextrapleural pneumonectomy at Brigham and Women's Hospital (Boston,Mass.) from 1992 to 1998 and flash frozen. All tissues were obtainedfrom patients who did not receive pre-operative treatment. Standardtissue banking procedures were followed throughout. Once brought to theHospital Tumor Bank, tissues were sliced into 3 mm³ portions andassigned an identifier to catalogue its position in the originalspecimen. Hematoxylin-stained slides were generated from each MPMspecimen in the Tumor Bank and reviewed by a pathologist for tumorcontent and histological subtype. A total of 80 specimens have beenidentified to date that contain relatively pure tumor (greater than 50%of cells in a high power field are tumor cells). Of these, 24 werechosen for microarray analysis. Linked clinical, epidemiological,outcome, and pathological data were obtained for all patients whocontributed tumor specimens and rendered anonymous to protect patientconfidentiality. Studies utilizing human tissues were approved by andconducted in accordance with the policies of the Institutional ReviewBoard at Brigham and Women's Hospital.

Tissue processing and RNA preparation. Total RNA was isolated fromfrozen tumor blocks using Trizol solution (Invitrogen Life Technologies,Carlsbad, Calif.) exactly per the manufacturer's recommended protocol.To initially assess total RNA degradation, a portion of RNA from eachsample was resolved on a 1% agarose/formaldehyde gel using standardprocedures (Ausubel, 1998). Gels were stained with ethidium bromide andbands representing ribosomal subunits 28S and 18S were visualized.Approximately 10% of samples collected were discarded secondary tounsatisfactory quality.

Total RNA (7 μg) was amplified and the product labeled with biotinfollowing a procedure previously described (Wang, 1999; Warrington,2000; O'Dell, 1999). Briefly, double-stranded cDNA was synthesized usingthe SuperScript Choice System (Invitrogen Life Technologies) and aT7-(dT)-24 first strand primer (Geneset Oligos, La Jolla, Calif.). ThecDNA was purified by phenol/chloroform/isoamyl alcohol extraction usinga phase lock gel (5 Prime-3 Prime, Inc., Boulder, Colo.) andconcentrated by ethanol precipitation. In vitro transcription wasperformed to produce biotin-labeled cRNA using a BioArray High Yield RNATranscript Labeling Kit (Affymetrix, Santa Clara, Calif.) according tothe manufacturer's instructions. Linearly amplified cRNA was obtained byincubation with T7 RNA polymerase. Final cRNA preparations were cleanedwith RNeasy Mini kit (Qiagen, Valencia, Calif.).

Hybridization of RNA to high density oligonucleotide microarrays. Priorto hybridization to experimental arrays, the quality of cRNA wasassessed for approximately half of all samples using test arrays(Affymetrix Test2 gene arrays) designed to compare relative expressionlevels of β-actin and GAPDH by using oligonucleotide probescomplementary to both the 3′ and 5′ ends of gene products. Hybridizationof test arrays was done as detailed below for experimental arrays withminor modifications as suggested by the manufacturer (Affymetrix).Biotinylated cRNA (20 μg) was fragmented and hybridized to microarrayscontaining oligonucleotide probe-sets representing approximately 12,000known human genes (Affymetrix U95A human array, HG-95Av2) according toAffymetrix protocols using a protocol described previously (O'Dell,1999). Essentially, the hybridization mixture was incubated at 99° C.for 5 min. followed by incubation at 45° C. for 5 min. before injectionof the sample into the probe array cartridge. Hybridization wasperformed at 45° C. for 16-18 hours. After washing, the array wasstained with streptavidin-phycoerythrin (Molecular Probes, Eugene,Oreg.) and the hybridization signal amplified using a biotinylatedanti-streptavidin antibody (Vector Laboratories, Inc., Burlingame,Calif.) before subsequent scanning in a HP GeneArray scanner(Affymetrix).

The intensity of all features of microarrays were captured and examinedfor artifacts using Affymetrix GeneChip® Software v. 4.0, according tostandard Affymetrix procedures (O'Dell, 1999). The “target intensity”was set to 100 for all samples. Each array contained several prokaryoticgenes which served as internal hybridization controls for RNA spikedinto experimental samples. Data from 5 arrays was uninterpretable anddiscarded, leaving a total of 19 samples in the final analysis. Of these19, 2 were tested in duplicate and 1 in triplicate. GeneChip® Softwarewas used to generate quantitative gene expression values (measured byaverage differences).

Real time quantitative RT-PCR. Gene expression data obtained frommicroarrays was verified using real time quantitative RT-PCR. PCRreactions were set up, optimized, and performed precisely following themanufacturer's recommended protocol (Sequence Detection System, AppliedBiosystems, Foster City, Calif.). Total RNA (2 μg) wasreverse-transcribed into cDNA using Taq-Man Reverse Transcriptionreagents and random hexamers as the primer (Applied Biosystems). PCRreactions were set up in a 25 μl reaction volume using SYBR Green PCRMaster Mix (Applied Biosystems). Optimized primers amplifying portionsof fibronectin, transgelin, complement factor B (CFB), and L32 ribosomalprotein cDNA were designed according to recommended specifications(Applied Biosystems), synthesized by Invitrogen Life Technologies, andused at a final concentration of 900 nM in the reaction mixture.

Primer sequences were as follows:

fibronectin 5′-GCCATGACAATGGTGTGAAC-3′ (SEQ ID NO: 1) and5′-GCAAATGGCACCGAGATATT-3′; (SEQ ID NO: 2) transgelin5′-AGGACTCTGGGGTCATCAAG-3′ (SEQ ID NO: 3) and5′-AGTTGGGATCTCCACGGTAG-3′; (SEQ ID NO: 4) CFB5′-TGAGGCTTCCTCCAACTACC-3′ (SEQ ID NO: 5) and5′-TGCCTTTCTTATCCCCATTC-3′; (SEQ ID NO: 6) L325′-AACCCAGAGGGATTGACAAC-3′ (SEQ ID NO: 7) and5′-ACTTCCAGCTCCTTGACGTT-3′. (SEQ ID NO: 8)

PCR amplification was performed in a 96-well format using optical platesand covers (Applied Biosystems) in an Applied Biosystems 5700 SequenceDetector. To confirm the absence of non-specific amplification in PCRreactions, no-template controls containing H₂O substituted for templatewere run in multiple wells on every reaction plate. In addition, amelting point disassociation curve was automatically generated afterevery experiment to confirm the presence of a single PCR species in allexperimental wells. The Comparative C_(T) method was used to obtainquantitative values for gene expression levels in all samples (AppliedBiosystems, see http://www.appliedbiosystems.com for details). Thismethod normalizes expression levels between samples using another(housekeeping) gene as a reference to standardize for different startingtemplate amounts.

Briefly, amplification reactions are characterized by the point in timeduring cycling when amplification of a PCR product is first detectedrather than by the amount of PCR product accumulated after a fixednumber of cycles. In the initial cycles of PCR, there is little changein fluorescence signal, which defines the baseline for the amplificationplot. An increase in fluorescence above the baseline indicates thedetection of accumulated PCR product. A fixed fluorescence threshold canbe set above the baseline. The parameter CT (threshold cycle) is definedas the fractional cycle number at which the fluorescence passes thefixed threshold. The higher the starting copy number of the nucleic acidtarget, the sooner an increase in fluorescence past the selectedthreshold is observed. A plot of the log of initial target copy numberfor a set of standards versus CT is a straight line. Therefore,quantification of the amount of target in unknown samples isaccomplished by measuring CT and using the standard curve to determinestarting copy number. The L32 ribosomal gene was used for this purposesince its expression levels did not vary substantially over all samples(from microarray data).

Data analysis. A hierarchical clustering algorithm (AGNES) in thestatistical package S-PLUS (Venables, 1997) was used to classify all 19MPM tumors according to relative variation in gene expression patterns.All linked clinical data was held exclusively by one investigator andrevealed only after cluster analysis was completed. Gene hybridizationintensities (from GeneChip® Software) were appropriately scaled to a“target intensity” of 100 to facilitate comparison of data from allarrays. To minimize contamination from signal background and saturationeffects (Hsiao, 2001), only genes with an expression value between 1,000and 5,000 were considered in the unsupervised cluster analysis.

The significance of observed differences in gene expression levelsbetween selected MPM subclasses was assessed using a Kruskal-Wallis test(nonparametric ANOVA) followed by Dunn's Multiple Comparison test. AMann-Whitney test was used for selective pairwise comparisons, such asmedian patient survival. The degree of correlation between patientsurvival and matched gene expression levels was examined using Spearmancorrelation calculations and trendlines were obtained by generating aLOWESS curve. Contingency tables were analyzed using a chi-square testfor independence and a chi-square test for trend. All differences weredetermined to be statistically significant if P<0.05. Calculations andstatistical comparisons were generated using GraphPad Prism v.3.02(GraphPad Software, San Diego, Calif.). To generate the graphicalrepresentations of relative gene expression levels (log 2 scale), allexpression levels were first normalized within samples by setting theaverage (mean) to 0 and the standard deviation to 1. Scaled levels wereassigned RGB values (representing 17 shades) for colorimetric display asa spectrum representing relative gene expression levels.

Results

Identification of MPM subclasses using hierarchical cluster analysis.Variation in the patterns of gene expression levels in 19 MPM tumorswere examined using unsupervised cluster analysis, and distinct classeswere identified based on similar expression profiles. The dendrogramspecifying the grouping of samples shows that MPM tumors segregate into2 major subclasses. A major division in the distribution of samplesseparates 6 tumor specimens from all others (designated Subclass 1). Theremaining 13 samples form 2 distinct subclasses that cluster tightly onopposite sub-branches (designated Subclasses 2 and 3). A maximum of 145genes with relatively high expression levels between 1,000 and 5,000(i.e, average difference) was sufficient to accurately define MPMsubclasses Only a portion of the 145 genes used for analysis areidentified here. The complete set is available in Table 6.

A set of 26 genes was identified as providing a redundant set ofdiagnostic gene expression ratios that can be used in differentcombinations (Table 2). There is some overlap to ensure completecoverage of samples to correct for “no-calls” for any one ratio.

TABLE 2 Gene elevated in Accession # Symbol Name SEQ ID NOs:* Meso/Adenoadeno J02761 SFTBP surfactant pulmonary  9, 10 associated protein Badeno M93036 TACSTD1 tumor associated 11, 12 calcium signal transducer 1adeno AJ011497 CLDN7 claudin 7 13, 14 meso X56667 CALB2 calbindin 2 15,16 (calretinin) meso X16662 ANXA8 annexin 8 17, 18 meso M21389 KRT5keratin 5 19, 20 Meso/Normal meso X03168 VTN vitronectin 21, 22 mesoX76029 NMU neuromedin U 23, 24 meso X56667 CALB2 calbindin 2 15, 16(calretinin) normal U43203 TITF1 thyroid transcription 25, 26 factor 1normal M18728 CEACAM6 carcinoembryonic 27, 28 Ag-related cell adhesionmolecule 6 normal T92248 UGB uteroglobin 29, 30 Meso/Squamous mesoAI651806 LOC51232 cysteine-rich repeat 31, 32 containing protein S52precursor meso X56667 CALB2 calbindin 2 15, 16 (calretinin) meso D83402PTGIS prostacyclin synthase 33, 34 squa U42408 LAD1 ladinin 1 35, 36squa L33930 CD24 CD24 antigen 37, 38 squa AI539439 S100A2 S100calcium-binding 39, 40 protein A2 Ept/All Other ept AL049963 LOC64116up-regulated by 41, 42 BCG-CWS ept L15702 CFB complement factor B 43, 44(CFB) other M95787 TAGLN transgelin 45, 46 Subclass 1/2 (also seeept/other) 1 AL049963 LOC64116 up-regulated by 41, 42 BCG-CWS 2NM_001953 ECGF1 endothelial cell 47, 48 growth factor 1 Subclass 1/3 1M22919 MYL6 myosin, light 49, 50 polypeptide 6 3 X06256 ITGA5 integrin,alpha 5 51, 52 Subclass 2/3 2 Z98946 MSN moesin 53, 54 2 AI540958 PINdynein 55, 56 3 AI677689 KIAA0685 KIAA0685 gene 57, 58 product 3 M31932FCGR2A Fc fragment of IgG, 59, 60 low affinity Iia meso = malignantpleural mesothelioma adeno = adenocarcinoma normal = normal lung tissuesqua = squamous carcinoma ept = epithelial *SEQ ID NOs are given asnucleotide sequence, amino acid sequence Genes that serve multiplepurposes (and are listed more than one) are italicized. There are atotal of 26 genes.

The functional distribution of these reliably expressed genes can beroughly classified as follows: 33% ribosomal, 7% cytoskeletal, 6%inflammatory/immune, 3% extracellular matrix (ECM), 3% intracellularsignaling, 2% proliferation, and 46% other, multiple, or unknownfunction. Approximately two-thirds of the 145 genes were expressed atsubstantially higher levels in samples from Subclass 1; the remainingone-third were expressed at relatively higher levels in samples fromSubclasses 2 and 3. Ribosomal proteins account for approximately 50% ofall genes overexpressed in Subclass 1. Genes overexpressed in Subclasses2 and/or 3 consisted predominately of cytoskeletal and ECM-related genessuch as actin, vimentin, tubulin, myosin, cofilin, osteonectin, andothers. The organization of MPM subclasses was extremely robust andreproducible. For example, samples assigned to Subclasses 1, 2, and 3remain in the same subclass when cluster analysis is repeatedincorporating data from 3 samples chosen at random for duplicate (n=2)and triplicate (n=1) hybridization experiments on different microarrays.

Clinical characteristics of MPM subclasses. Linked clinical data forindividual samples are presented in tabular format and arrangedaccording to subclass membership (Table 3). Samples in Subclass 1 and 2consisted exclusively of specimens histologically classified asepithelial and mixed subtypes, respectively. Subclass 3 consisted ofmembers of all major histological subtypes: epithelial (n=3), mixed(n=4), and sarcomatoid (n=2). Two samples were excluded from survivalanalysis since they originated from patients whose status was eitherunknown (sample 116, Subclass 1) or who did not die from disease (sample118, Subclass 2). The analysis of cancer related clinical outcome (usingnonparametric ANOVA) revealed that the median survival (19 months) ofpatients in Subclass 1 (all epithelial subtype) was significantly higher(P<0.01) than the median survival (2 months) of patients in Subclass 2(all mixed subtype). The median survival of patients with variedhistology in Subclass 3 (11 months) was intermediate to Subclasses 1 and2, but nonetheless was not significantly different (P>0.05) from that ofeither Subclass. There was no significant difference in survival betweenpatients with epithelial histology classified as either Subclass 1 or 3.However, patients with mixed histology classified as Subclass 2 samples(all “short-lived” mixed) had significantly shorter (P=0.029) mediansurvival (2 months) when compared to the median survival (11.5 months)of patients with mixed histology classified as Subclass 3 (all“long-lived” mixed). Although asbestos count and exposure historyappeared to be lower in members of Subclass 1, there are insufficientdata at this time to draw meaningful conclusions. Beyond patientsurvival, there were no other significant associations between samplesin MPM subclasses and any other aspect of the clinical data.

TABLE 3 Sample # Age Sex Histology^(a) Smoking Hx Pack Years^(b)Asbestos Exposure Hx Asbestos Fiber Count^(c) Survival^(d) Status Group1 76 67 m ept. yes — neg. — 17 3 86 42 f ept. no 0 neg. 0 9 3 116 70 mept. yes — neg. 50 9 U 90 48 m ept. no 0 pos. — 28 2 68 61 m ept. yes 30neg. 168 21 3 109 62 m ept. yes 14 pos. — 19 3 Group 2 89 55 m mixed yes4.5 pos. 147 3 3 118 74 m mixed yes 73.8 pos. 119 7 4 133 69 m mixed no0 pos. 17,547 2 3 114 51 m mixed yes 3.75 neg. 2919 2 3 Group 3 101 71 fept. yes 15 pos. — 11 3 93 52 m ept. yes 113 pos. — 15 3 229 33 f ept.no 0 pos. 13 5 3 105 66 m mixed yes 38 pos. 290 12 3 72 46 m mixed yes25 pos. 6540 53 3 213 55 m mixed yes 5 pos. 266 11 1 130 55 m mixed yes10 pos. 69 6 3 159 62 m sarc. no 0 pos. — 2 3 166 66 m sarc. yes 30 pos.451 6 3 —, data unavailable ^(a)ept. epithelial; sarc., sarcomatoid^(b)packs per day × years smoking ^(c)per gram of lung tissue; controlmedian value ~70 ^(d)in months ^(e)1, alive without disease; 2, alivewith disease; 3, dead with disease; 4, dead other causes; U, unknown

Identification of prognostic and diagnostic molecular markers for MPM.To identify candidate prognostic molecular markers for MPM, all 12,000genes on the microarray were searched to find those with expressionlevels that were significantly different between tumors of Subclass 1(best prognosis) and Subclass 2 (worst prognosis). Not surprisingly,many of these genes also distinguished samples of epithelial histologyfrom tumors of all other subtypes, and for this reason may further serveas diagnostic molecular markers. Approximately 400 genes on themicroarray fit our selection criteria (average expression level >1000 inat least one group) and the most statistically significant (P=10⁻³-10⁻⁶)11 genes are listed in Table 4.

TABLE 4 Accession # Name SEQ ID NOs:* M95787 SM22 (transgelin) 45, 46Z82215 DNA sequence from PAC 68O2 61, 62 X13839 vascular smooth musclealpha-actin 63, 64 L19182 MAC25 65, 66 J04599 proteoglycan I (biglycan)67, 68 X02761 fibronectin 69, 70 X15882 collagen VI 71, 72 Y14690procollagen alpha 2(V) 73, 74 L15702 complement factor B (CFB) 43, 44N47307 cDNA, 3 end/clone = IMAGE- 75, 76 280506 L38941 ribosomal proteinL34 (RPL34) 77, 78 *SEQ ID NOs are given as nucleotide sequence, aminoacid sequenceThe fact that these genes also distinguish samples of the epithelialsubtype suggests that tumors of the same histological subtype possesscertain similarities in gene expression despite being assigned todifferent classes using hierarchical clustering. Another set of genesthat distinguishes between subclass 1 and subclass 2 is presented inTable 7. Of the 11 genes in Table 4, the first 8 are expressed atrelatively lower levels and the final three genes are expressed athigher levels in epithelial subtype samples (subclass 1) compared to allothers.

From this set, two genes with relatively low levels of expression inepithelial subtype samples (transgelin and fibronectin, P=10⁻⁴ and0.0028 respectively) and one with relatively high levels of expressionin epithelial subtype samples (complement factor B (CFB), P=10⁻⁴) werechosen. Expression levels for all three genes were significantly(P<0.05) correlated with survival irrespective of histological subtype.Furthermore, expression level ratios of transgelin/CFB andfibronectin/CFB were also significantly correlated with survivalindependent of histology. Levels of transgelin and fibronectin wereapproximately equal in individual samples; accordingly, thefibronectin/transgelin ratio remains close to 1 for all samples and isnot correlated with survival.

When individual patient ratios of either transgelin/CFB orfibronectin/CFB were 10 or greater, survival was lowest (mediansurvival=2 months, range 2-6 months). Conversely, when these ratios are1 or less, patient survival was substantially higher (median survival=13months, range 5-28 months). Additionally, 80% (8/10) of patients whoseratio is less than 1 survived at least 9 months.

Another set of genes was identified to distinguish betweenadenocarcinoma and malignant pleural mesothelioma in accordance with theprocedures described above. These genes are presented in Table 8.

Validation of microarray-based analysis of gene expression. QuantitativeRT-PCR was utilized to verify gene expression levels of the molecularmarkers identified in microarray-based analysis. As expected, averageexpression levels for CFB were significantly higher (P=0.019) andaverage expression levels for transgelin and fibronectin significantlylower (P=0-038, and P=0.024, respectively) in samples from Subclass 1(good prognosis) compared to Subclass 2 (poor prognosis) (FIG. 4A).

We then determined whether expression ratios created using microarraydata could be accurately reproduced using quantitative RT-PCR data.Ratios were created by expressing individual gene levels in epithelialsubtype samples relative to levels in all other subtypes combined (FIG.43). We found that ratios created using data from both platforms were inrelative agreement for all 3 genes. Expression level ratios (andindividual expression levels) obtained from RT-PCR were alsosignificantly correlated with survival for transgelin/CFB (P=0.0015) andfibronectin/CFB (P=0.009) independent of the histological subtype,

Verification of Expression Level Ratios as Prognostic and DiagnosticMolecular Markers.

To verify the prognostic capability of gene expression ratios,quantitative RT-PCR was used to obtain expression level values fortransgelin and CFB in 17 additional tumor samples not subjected tomicroarray analysis. (Two of these samples were omitted because they didnot express detectable levels of one or both genes.) Based on the prioranalysis of samples using microarrays, we hypothesized for the remaining15 samples that patients with transgelin/CFB ratios above 1 (n=6) wouldhave generally poor prognosis, and those with ratios below 1 (n=9) wouldhave generally good prognosis. In this case, ratios correctly identifiedthe 3 individuals with the best clinical outcome (20-, 21-, and 51-monthsurvival) and the 3 individuals with the worst clinical outcome (2-, 2-,and 4-month survival).

To increase the sample size for statistical considerations, RT-PCR datafrom these patients was combined with that from patients whose tumorswere subjected to microarray analysis, for a total of 32 samples. Inthis larger sample set, patient survival was significantly (P=0.0011)correlated with matched values for transgelin/CFB expression ratios. Asexpected, median patient survival is inversely proportional to the valueof the transgelin/CFB expression ratio.

Next, we formed a contingency table by sorting the number of patientsamples with transgelin/CFB expression ratios either above 1 or below 1into groups representing 5-month survival increments (Table 5). In thiscase, the prognostic value of the transgelin/CFB expression ratio wasagain confirmed. Statistical analysis revealed that survival and ratiovalue were significantly associated (P=0.007) and that this associationfollows a significant linear trend (P=0.0076). Still, prediction ofprognosis was most efficient at either survival extreme (<5 monthsand >15 months) with 100% of samples from patients with poor survivalhaving ratios >1 and nearly 85% of patients with the longest survivalhaving ratios <1 (Table 5).

TABLE 5 T/C Ratio^(a) Median Patient Survival^(b) Greater than 10 4 (n =8) Greater than 1  5 (n = 14) Less than 1 12 (n = 18) Less than 0.5 14(n = 12) Less than 0.1 17 (n = 4)  All samples  9 (n = 32) Survival^(b)Samples with T/C ratio <1 Samples with T/C ratio >1  <5 0/7 (0%)   7/7(100%)  5-10 8/11 (73%)  3/11 (27%)  10-15 5/8 (63%) 3/8 (37%) >15 5/6(83%) 1/6 (17%) ^(a)Value of transgelin/CFB gene expression level ratio^(b)in months

As mentioned previously, prognostic markers were originally selected byexamining gene expression level differences between samples insubclasses with the greatest difference in median patient survival(Subclass 1 and Subclass 2, see Table 3). It also was found that thesegenes could distinguish tumors of the epithelial subtype from allothers. Then, using the larger cohort of samples, we examined whetherthere was evidence that the transgelin/CFB expression ratio provided avaluable diagnostic tool in addition to a predictor of prognosis. Wefound that the transgelin/CFB expression ratio identified thehistological subtypes of tumors with a high degree of accuracy. Allepithelial subtype tumors (16/16, 100%) had ratio values <1 and nearlyall mixed/sarcomatoid subtype tumors (14/16, 88%) had ratio values >1.The 2 non-epithelial subtype samples that were incorrectly diagnosed inthis case originated from patients with atypically long survival (12 and13 months), much longer than the median survival (6 months) of allnon-epithelial subtype samples, thereby reflecting the original intentof the ratio (i.e. prediction of prognosis).

Following the use of appropriate filtering techniques (e.g., Hsiao,2001), expression level ratios were found to be extremely robust indifferentiating the epithelial subtype of MPM using raw data obtainedfrom Affymetrix arrays with probe sets representing 6,800 genes (n=11)and from Affymetrix arrays with probe sets representing 12,000 geneshybridized and scanned by another laboratory (n=13). In these 24samples, the transgelin/CFB ratio correctly predicted histologicalsubtype in 18 with 2 errors and 4 marginal calls. (The marginal callswere conservatively made when the ratio value was between 0.5 and 2.)

The genes used to create expression ratios (e.g., transgelin and CFB)are not random predictors of diagnosis/prognosis, but have notablebiological relevance to carcinogenesis. CFB is significantlyoverexpressed in epithelial tumors while transgelin is significantlyunderexpressed in the same samples. Transgelin binds to native actinfilament bundles and gels actin in vitro (Shapland, 1993) and has beenproposed as a marker of neoplastic transformation (Lawson, 1997). CFBhas been implicated in tumor apoptosis in a manner independent ofTNF/TNFR or FasL/Fas interactions (Uwai, 2000). Although not determinedin this study to have a functional role in MPM carcinogenesis, CFB'spro-apoptotic function is consistent with the observations showing highlevels of this gene significantly correlated with relatively goodprognosis (i.e. survival).

TABLE 6 Accession No.: Description M81757: H. sapiens S19 ribosomalprotein mRNA, complete cds U14969: Human ribosomal protein L28 mRNA,complete cds M62895: Human lipocortin (LIP) 2 pseudogene mRNA, completecds-like region AL022097: Homo sapiens DNA sequence from PAC 256G22 onchromosome 6p24.1-25.3. Z28407: H. sapiens mRNA for ribosomal protein L8X64707: H. sapiens BBC1 mRNA U14971: Human ribosomal protein S9 mRNA,complete cds X17206: Human mRNA for LLRep3 M17885: Human acidicribosomal phosphoprotein P0 mRNA, complete cds AL031228: dJ1033B10.4(40S ribosomal protein S18 (RPS18, KE-3 L11566: Homo sapiens ribosomalprotein L18 (RPL18) mRNA, complete cds M17733: Human thymosin beta-4mRNA, complete cds U14972: Human ribosomal protein S10 mRNA, completecds M64716: Human ribosomal protein S25 mRNA, complete cds X67247: H.sapiens rpS8 gene for ribosomal protein S8 X16064: Human mRNA fortranslationally controlled tumor protein J04755: Human ferritin Hprocessed pseudogene, complete cds L05095: Homo sapiens ribosomalprotein L30 mRNA, complete cds AL022326: dJ333H23.1.1 (60S RibosornalProtein L3) Z48501: H. sapiens mRNA for polyadenylate binding protein IIX69391: H. sapiens mRNA for ribosomal protein L6 X65923: H. sapiens faumRNA M17886: Human acidic ribosomal phosphoprotein P1 mRNA, completecds/ M17886: Human acidic ribosomal phosphoprotein P1 mRNA, complete cdsL06499: Homo sapiens ribosomal protein L37a (RPL37A) mRNA, complete cdsX55954: Human mRNA for HL23 ribosomal protein homologue M13934: Humanribosomal protein S14 gene, complete cds X63527: H. sapiens mRNA forribosomal protein L19 U14968: Human ribosomal protein L27a mRNA,complete cds U14970: Human ribosomal protein S5 mRNA, complete cdsL06498: Homo sapiens ribosomal protein S20 (RPS20) mRNA, complete cdsX53777: Human L23 mRNA for putative ribosomal protein Z12962: H. sapiensmRNA for homologue to yeast ribosomal protein L41 AB002533: Homo sapiensmRNA for Qip1, complete cds X80822: H. sapiens mRNA for ORF L01124:Human ribosomal protein S13 (RPS13) mRNA, complete cds D23661: HumanmRNA for ribosomal protein L37, complete cds L38941: Homo sapiensribosomal protein L34 (RPL34) mRNA, complete cds X95404: H. sapiens mRNAfor non-muscle type cofilin U14966: Human ribosomal protein L5 mRNA,complete cds X52851: Human cyclophilin gene for cyclophilin (EC 5.2.1.8)AF037643: Homo sapiens 60S ribosomal protein L12 (RPL12) pseudogene,partial sequence Z49148: H. sapiens mRNA for ribosomal protein L29X15940: Human mRNA for ribosomal protein L31 M94314: Homo sapiensribosomal protein L30 mRNA, complete cds Z26876: H. sapiens gene forribosomal protein L38 Z19554: H. sapiens vimentin gene X04098: HumanmRNA for cytoskeletal gamma-actin M13932: Human ribosomal protein S17mRNA, complete cds M13932: Human ribosomal protein S17 mRNA, completecds M24194: Human MHC protein homologous to chicken B complex proteinmRNA, complete cds M24194: Human MHC protein homologous to chicken Bcomplex protein mRNA, complete cds M58458: Human ribosomal protein S4(RPS4X) isoform mRNA, complete cds AB021288: Homo sapiens mRNA for beta2-microglobulin, complete cds X55715: Human Hums3 mRNA for 40S ribosomalprotein s3 AL031670: dJ681N20.2 (ferritin, light polypeptide-like 1)X56932: H. sapiens mRNA for 23 kD highly basic protein X67309: H.sapiens gene for ribosomal protein S6 X57958: H. sapiens mRNA forribosomal protein L7 U09953: Human ribosomal protein L9 mRNA, completecds K00558: human alpha-tubulin mRNA, complete cds X03342: Human mRNAfor ribosomal protein L32 M31520: Human ribosomal protein S24 mRNAX63432: H. sapiens ACTB mRNA for mutant beta-actin (beta-actin) X06617:Human mRNA for ribosomal protein S11 AB009010: Homo sapiens mRNA forpolyubiquitin UbC, complete cds AB009010: Homo sapiens mRNA forpolyubiquitin UbC, complete cds U37230: Human ribosomal protein L23amRNA, complete cds M26252: Human TCB gene encoding cytosolic thyroidhormone-binding protein, complete cds D23660: Human mRNA for ribosomalprotein, complete cds L20941: Human ferritin heavy chain mRNA, completecds M16660: Human 90-kDa heat-shock protein gene, cDNA, complete cdsM22919: Human nonmuscle/smooth muscle alkali myosin light chain gene,complete cds U34995: Human normal keratinocyte substraction librarymRNA, clone H22a, complete sequence Z23090: H. sapiens mRNA for 28 kDaheat shock protein J03077: Human co-beta glucosidase (proactivator)mRNA, complete cds X56009: Human GSA mRNA for alpha subunit of GsGTPbinding protein X04409: Human mRNA for coupling protein G(s)alpha-subunit (alpha-S1) M14630: Human prothymosin alpha mRNA, completecds AB011114: Homo sapiens mRNA for KIAA0542 protein, complete cdsAI201310: qf71b11.x1 Homo sapiens cDNA, 3 end AI525834: PT1.3_06_D01.rHome sapiens cDNA, 5 end AF054187: Homo sapiens alpha NAC mRNA, completecds AF054187: Homo sapiens alpha NAC mRNA, complete cds J04182: Homosapiens lysosomal membrane glycoprotein-1 (LAMP1) mRNA, complete cdsR87876: yo45h01.r1 Homo sapiens cDNA, 5 end J03592: Human ADP/ATPtranslocase mRNA, 3 end T89651: yd99a05.s1 Homo sapiens cDNA, 3 endX79234: H. sapiens mRNA for ribosomal protein L11 X13546: Human HMG-17gene for non-histone chromosomal protein HMG-17 D32129: Human mRNA forHLA class-I (HLA-A26) heavy chain, complete cds (clone cMIY-1) X57352:Human 1-8U gene from interferon-inducible gene family U73824: Human p97mRNA, complete cds U49869: Human ubiquitin gene, complete cds AI526078:DU3.2-7.G08.r Home sapiens cDNA, 5 end AI557852: P6test.G05.r Homesapiens cDNA, 5 end X58965: H. sapiens RNA for nm23-H2 gene X74929: H.sapiens KRT8 mRNA for keratin 8 W52024: zd13a03.s1 Home sapiens cDNA, 3end AL050224: Homo sapiens mRNA; cDNA DKFZp586L2123 (from cloneDKFZp586L2123) AI541542: libtest16.A02.r Homo sapiens cDNA, 5 endM33680: Human 26-kDa cell surface protein TAPA-1 mRNA, complete cdsM63573: Human secreted cyclopbilin-like protein (SCYLP) mRNA, completecds Z11692: H. sapiens mRNA for elongation factor 2 M22806: Human prolyl4-hydroxylase beta-subunit and disulfide isomerase (P4HB) gene X62654:H. sapiens gene for Me491/CD63 antigen X13710: H. sapiens unapliced mRNAfor glutathione peroxidase J00194: human hla-dr antigen alpha-chain mrna&ivs fragments X58536: Human mRNA for HLA class I locus C heavy chainU15131: Human p126 (ST5) mRNA, complete cds L13210: Human Mac-2 bindingprotein mRNA, complete cds AI541256: pec1.2-3.F11.r Homo sapiens cDNA, 5end J04599: Human hPGI mRNA encoding bone small proteoglycan I(biglycan), complete cds AA044823: zk72a10.s1 Homo sapiens cDNA, 3end/clone = IMAGE-488346 J02984: Human insulinoma rig-analog mRNAencoding DNA-binding protein, complete cds AF095154: Homo sapiensC1q-related factor mRNA, complete cds L41498: Homo sapiens longationfactor 1-alpha 1 (PTI-1) mRNA, complete cds X56681: Human junD mRNAM94046: Human zinc finger protein (MAZ) mRNA AA977163: oq25a04.s1 Homosapiens cDNA, 3 end AA977163: oq25a04.s1 Homo sapiens cDNA, 3 end M55914= HUMCMYCQ Human c-myc binding protein (MBP-1) mRNA, complete cds M64241= HUMQM Human Wilm a tumor-related protein (QM) mRNA, complete cdsX58965 = HSNM23H2G H. sapiens RNA for nm23-H2 gene D11139 = HUMTIMPHuman gene for tissue inhibitor of metalloproteinases, partial sequenceM55409 = HUMPANCAN Homo sapiens pancreatic tumor-related protein mRNA,partial cds M84711 = HUMFTE1A Human v-fos transformation effectorprotein (Fte-1), mRNA complete cds X56681 = HSJUNDR Human junD mRNAM26880 = HUMUBI13 Human ubiquitin mRNA, complete cds X04803 = HSYUBG1Homo sapiens ubiquitin gene D78361 = HUMODAZ Human mRNA for ornithinedecarboxylase antizyme, ORF 1 and ORF 2 J04617 = HUMEF1A Humanelongation factor EF-1-alpha gene, complete cds J04988 = HUMHSP90B Human90 kD heat shock protein gene, complete cds D00017 = HUMLIC Homo sapiensmRNA for lipocortin II, complete cds J03040 = HUMSPARC HumanSPARC/osteonectin mRNA, complete cds J04164 = HUM927A Humaninterferon-inducible protein 9-27 mRNA, complete cds V00567 = HSMGLOHuman messenger RNA fragment for the beta-2 microglobulin D14530 =HUMRSPT Human homolog of yeast ribosomal protein S28, complete cdsRibosomal Protein S20 M14199 = HUMLAMR Human laminin receptor (2H5epitope) mRNA, 5 end M63138 = HUMCATD5 Human cathepsin D (catD) gene,exons 7, 8, and 9 S82297 = S82297 beta 2-microglobulin V00599 = HSTUB2Human mRNA fragment encoding beta-tubulin. (from clone D-beta-1)

TABLE 7 Symbol Description BF B-factor, properdin MSLN mesothelin TM4SF1transmembrane 4 superfamily member 1 CYC1 cytochrome c-1 RPL12 ribosomalprotein L12 POLR2L polymerase (RNA) II (DNA directed) polypeptide L (7.6kD) RPL18 ribosomal protein L18 RPL18A ribosomal protein L18a RPS23ribosomal protein S23 RPS21 ribosomal protein S21 RPL27 ribosomalprotein L27 K-ALPHA-1 tubulin, alpha, ubiquitous ARHGAP1 Rho GTPaseactivating protein 1 TPM1 tropomyosin 1 (alpha) APOL apolipoprotein LTPM1 tropomyosin 1 (alpha) SPARC secreted protein, acidic, cysteine-rich(osteonectin) COL1A2 collagen, type I, alpha 2 FN1 fibronectin 1 NAFibronectin, Alt. Splice 1 FN1 fibronectin 1 COL5A2 collagen, type V,alpha 2 COL1A2 collagen, type I, alpha 2 ACTA2 actin, alpha2, smoothmuscle, aorta TAGLN transgelin

TABLE 8 Accession # Symbol Description U38980 PMS2L11 postmeioticsegregation increased 2-like 11 J04152 TACSTD2 tumor-associated calciumsignal transducer 2 AI820718 Homo sapiens cDNA, 5 end U43203 TITF1thyroid transcription factor 1 AB000714 CLDN3 claudin 3 AJ002308 SYNGR2synaptogyrin 2 AB000712 CLDN4 claudin 4 AF015128 Homo sapiens IgG heavychain variable region (Vh26) mRNA M18728 CEACAM6 carcinoembryonicantigen-related cell adhesion molecule 6 D83402 Homo sapiens gene forprostacyclin synthase J02761 SFTPB surfactant, pulmonary-associatedprotein B X56667 CALB2 calbindin 2, (29 kD, calretinin) X16662 ANXA8annexin (A8 vascular anticoagulant- beta (VAC-beta)) AB016789 GFPT2glutamine-fructose-6-phosphate transaminase 2 Z93930 XBP1 X-box bindingprotein 1 AI651806 LOC51232 cysteine-rich repeat-containing protein S52precursor, AW024285 Homo sapiens cDNA, 3 end AI445461 TM4SF1transmembrane 4 superfamily member 1 M93036 TACSTD1 tumor-associatedcalcium signal transducer 1 M21389 KRT5 keratin 5

REFERENCES FOR EXAMPLE 2

-   1. Golub, T. R. et al. Molecular classification of cancer: class    discovery and class prediction by gene expression monitoring.    Science 286, 531-537 (1999).-   2. Perou, C. M. et al. Molecular portraits of human breast tumours.    Nature 406, 747-752 (2000).-   3. Hedenfalk, I. et al. Gene expression profiles in hereditary    breast cancer. N Engl J Med 344, 539-548 (2001).-   4. Khan, J. et al. Classification and diagnostic prediction of    cancers using gene expression profiling and artificial neural    networks. Nat Med 7, 673-679 (2001).-   5. Quackenbush, J. Computational analysis of microarray data. Nat    Rev Genet. 2, 418-427 (2001).-   6. Corson, J. M. & Renshaw, A. A. Pathology of mesothelioma in    Comprehensive Textbook of Thoracic Oncology (eds Aisner, J.,    Arriagada, R. Green, M. R., Martini, N. & Perry, M. C.) 757-758    (Williams and Wilkins, Baltimore, Md., 1996).-   7. Virtaneva, K. et al. Expression profiling reveals fundamental    biological differences in acute myleoid leukemia with isolated    trisomy 8 and normal cytogenetics. Proc Natl Acad Sci USA 98,    1124-1129 (2001).-   8. Welsh, J. B. et al. Analysis of gene expression profiles in    normal and neoplastic ovarian tissue samples identifies candidate    molecular markers of epithelial ovarian cancer. Proc Natl Acad Sci    USA 98, 1176-1181 (2001).-   9. Clark, E. A., Golub, T. R., Lander, E. S. & Hynes, R. O. Genomic    analysis of metastasis reveals an essential role for RhoC. Nature    406, 532-535 (2000).-   10. Mountain, C. F. Revisions in the international system for    staging lung cancer. Chest 111, 1710-1717 (1997).-   11. Fodor, S. A. Massively parallel genomics. Science 277, 393-395    (1997).-   12. Lawson, D., Harrison, M. & Shapland, C. Fibroblast transgelin    and smooth muscle SM22a are the same protein, the expression of    which is down-regulated in many cell lines. Cell Motil Cytoskeleton    38, 250-257 (1997).-   13. Shapland, C., Hsuan, J. J., Totty, N. F. & Lawson, D.    Purification and properties of transgelin: A transformation and    shape change sensitive actin-gelling protein, J Cell Biol 121,    1065-1073 (1993).-   14. Uwai, M. et al. A new apoptotic pathway for the complement    factor B-derived fragment Bb. J Cell Physiol 185, 280-292 (2000).-   15. Sugarbaker, D. J. et al. Extrapteural pneumonectomy in the    multimodality therapy of malignant pleural mesothelioma. Results in    120 consecutive patients. Ann Surg 224, 288-294 (1996).-   16. Wang, K. et al. Monitoring gene expression profile changes in    ovarian carcinomas using cDNA microarrays. Gene 229, 101-108 (1999).-   17. Warrington, J. A., Nair, A., Hahadevappa, M. & Tsyganskaya, M.    Comparison of human adult and fetal expression and identification of    535 housekeeping/maintenance genes. Physiol Genomics 2, 143-147    (2000).-   18. O'Dell, S. D., Bujac, S. R., Miller, G. J. & Day, I. N.    Associations of IGF2 ApaI RFLP and INS VNTR class I allele size with    obesity. Eur J Hum Genet 7, 565-576 (1999).-   19. Venables, W. N. & Riley, B. D. Modern Applied Statistics with    S-Plus, (Springer, N.Y., 1997).-   20. Harrison's Principles of Internal Medicine, 14/e, (McGraw-Hill    Companies, New York, 1998).-   21. The Chipping Forecast, Nature Genetics, 21(1), 1-60 (1999).-   22. Gwynne, P., and Page, G., Microarray Analysis: the next    revolution in Molecular Biology, Science eMarketplace, Science, Aug.    6 (1999). (sciencemag.org/feature/e-market/benchtop/micro.shi)-   23. Molecular Cloning: A Laboratory Manual, J. Sambrook, et al.,    eds., Second Edition, (Cold Spring Harbor Laboratory Press, Cold    Spring Harbor, N.Y., 1989).-   24. Current Protocols in Molecular Biology, F. M. Ausubel, et al.,    eds., (John Wiley & Sons, Inc. New York, 1999).-   25. Wagner et al., Nature Biotechnol. 14, 840-844 (1996).-   26. Sainio, K., Saarma, M., Nonclercq, D., Paulin, L., and    Sariola, H. Antisense inhibition of low-affinity nerve growth factor    receptor in kidney cultures: power and pitfalls. Cell Mol.    Neurobiol. 14(5), 439-457 (1994).

Example 3 Prediction of Outcome in Mesothelioma Using Gene ExpressionRatios Introduction

Malignant pleural mesothelioma is an asbestos related, lethal neoplasticdisease of the pleura (median survival between 4 and 12 months)subdivided into three major histological subtypes: epithelial, mixed,and sarcomatoid (1-4). Compared to patients with non-epithelialsubtypes, patients with the epithelial subtype show a survival benefitfrom a variety of treatment strategies, including aggressivemulti-modality therapy (5-7). Currently, patients who present to ourunit with unilateral mesothelioma without extrapleural invasion undergocomplete surgical resection (extra-pleural pneumonectomy) followed bychemoradiation. The 5-year survival for those patients with stage I andepithelial histology is 40%. However, there are no predictive factors,prognostic molecular markers, or genetic abnormalities other thanhistological subtype to preoperatively identify these (or other)long-term survivors. In addition, established methods to predict outcomein mesothelioma based on histological appearance are somewhatsubjective, prone to human error, and are ineffective for small patientcohorts or in extreme cases for individual patients (3,8,9).

Gene expression profiling using microarrays holds promise to improvestrategies for tumor classification as well as for prediction ofresponse to therapy and survival in cancer (10-16). Nevertheless, noclear consensus exists regarding which computational tools are optimalfor the analysis of large gene expression profiling data sets,particularly when predicting outcome. As a result, microarray-basedresearch has not yet significantly impacted the clinical treatment ofdisease. Recently, we have shown that simple ratios of gene expressionusing as few as four to six genes are highly accurate in the diagnosisof cancer and we hypothesized that this technique was equally useful inadditional clinical applications (17). To explore this further, we usedgene expression profiling data (17) of mesothelioma samples frompatients with widely divergent survival to create an expressionratio-based test capable of predicting outcome in mesothelioma in amanner independent of the histological subtype of the tumor. We foundthat a simple test (based on the expression levels of four genes) can(i) predict outcome in mesothelioma with high accuracy, (ii) userelatively inexpensive data acquisition platforms, and (iii) analyzeindividual patients without reference to additional samples.

Methods

Mesothelioma tumor tissues. Discarded mesothelioma surgical specimenswere freshly collected and flash frozen from patients undergoingdefinitive surgery for mesothelioma at Brigham and Women's Hospital whodid not receive pre-operative treatment (6). To train an outcomepredictor model in this study, we used previously published microarraydata (17) to identify a subset of mesothelioma samples obtained frompatients with widely divergent survival (n=17 total). An additional 29samples (i.e., the test set) were used for quantitative RT-PCR analysisonly. Each tumor specimen contained greater than 50% tumor cells. Linkedclinical and pathological data were obtained for all patients whocontributed tumor specimens and rendered anonymous to protect patientconfidentiality. Studies utilizing human tissues were approved by andconducted in accordance with the policies of the Institutional ReviewBoard at Brigham and Women's Hospital.

Real time quantitative RT-PCR. Total RNA (2 μg) isolated from 29 tumorsin the test set was reverse-transcribed into cDNA using Taq-Man ReverseTranscription reagents (Applied Biosystems, Foster City, Calif.) andquantified using all recommended controls. Primer sequences (synthesizedby Invitrogen Life Technologies) were as follows (forward and reverse):

L6 5′-TTCCATTCCACAATGTGCTT-3′ (SEQ ID NO: 79) and5′-GGCCAGTGGAACTACACCTT-3′; (SEQ ID NO: 80) KIAA09775′-AACCGAAGCCTAACCTGAGA-3′ (SEQ ID NO: 81) and5′-GTCATTTTGGGAGGAGGTTT-3′; (SEQ ID NO: 82) GDIA15′-AGAAGCAGTCGTTTGTGCTG-3′ (SEQ ID NO: 83) and5′-TGTACTTCATGCCGGACACT-3′; (SEQ ID NO: 84) and CTHBP5′-ATCTGAAGTTTGGGGTCGAG-3′ (SEQ ID NO: 85) and5′-TCTCTCCCAGGACCTTCCTA-3′. (SEQ ID NO: 86)

PCR amplification was performed using an Applied Biosystems 5700Sequence Detector. No-template (negative) controls containing H₂Osubstituted for template were run in multiple wells on every reactionplate. An automatically calculated melting point disassociation curvegenerated after every assay was examined to ensure the presence of asingle PCR species and a lack of primer-dimer formation in each well.The Comparative C_(T) method (Applied Biosystems) was used with minormodifications to obtain quantitative values for gene expression ratiosin all samples. Calculation of an expression ratio using data from twogenes in any single sample negates the need for a calibrator sample anda reference gene to standardize for different starting template amounts.Therefore, to form expression ratios of two genes, we merely stated theexpression level of one gene relative to the other. In this case, theAACT value in the Comparative C_(T) equation reduces to:[C_(T(gene 1))−C_(T(gene 2))].

Data and statistical analysis. A two-sided Student's parametric) t-testwas used for pair-wise comparisons of average gene expression levelsamong multiple groups and the Significance Analysis of Microarrays (SAM)algorithm (18) was used to estimate the false discovery rate.Kaplan-Meier curves were used to estimate survival in each group. Thelog-rank test was used to statistically assess differences amongmultiple survival curves. A Cox proportional-hazards regression modelwas used for multivariate analysis. The “leave-one-out” method of crossvalidation (16,19,20) was used to assess internal consistency of thepredictor model and analyzed using Fisher's exact test (i.e. 2×2contingency table). All differences were determined to be statisticallysignificant if P<0.05. Data from three highly accurate gene expressionratios were combined by calculating the geometric mean, (R₁R₂R₃)^(1/3),where R_(i) represents a single ratio value. This is the mathematicalequivalent to the average of [log₂(R₁), log₂(R₂), log₂(R₃)], therebygiving equal weight to ratio fold-changes of identical magnitude butopposite direction. All calculations and statistical comparisons weregenerated using S-PLUS (21).

Results

Identification of prognostic molecular markers in mesothelioma. We havepreviously identified for study a representative cohort of 31mesothelioma tumors obtained at pneumonectomy (17). The estimated medianpatient survival (11 months, FIG. 6A) and histological distribution ofthis group mirror those of mesothelioma patients in our practice (6).The histological subtype of the tumor was not predictive of outcome forthese samples (P=0.129, log-rank test, FIG. 6B), even though theestimated median survival of epithelial subtype samples (17 months) waslonger than that for non-epithelial subtype samples (3.5 months). Toidentify genes that are discriminatory between tumors from patients withwidely divergent survival and to create an expression ratio-basedpredictor model, we utilized microarray data (117) for mesotheliomasamples that originated from patients whose survival was within the25^(th) percentile of both disease-related survival extremesirrespective of tumor histological subtype (i.e., the training set,n=17, Table 9A). We formed two groups using these samples: relativelygood outcome (survival≧17 months, n=8) and relatively poor outcome(survival≦6 months, n=9). The most accurate model developed in thetraining set was subsequently tested in an independent cohort of samples(i.e. the test set, n=29, Table 9B). We searched all of the genesrepresented on the microarray for those with a statistically significant≧2-fold difference in average expression levels between good outcome andpoor outcome tumors in the training set of samples. To minimize theeffects of background noise, the list of distinguishing genes wasfurther refined by requiring that the mean expression level be >500 inat least one of the two sample sets. We identified a total of 46prognostic genes in this analysis with an estimated false discovery rateof 10%-20%. The 10 genes with the lowest P values overexpressed in eachgroup are listed in Table 10.

TABLE 9A Clinical characteristics of MPM tumors, Training Set TrainingSet Age BWH Survival Sample (years) Sex Histology^(a) Stage (months)Status^(b) 72 46 m mixed 2 53 3 74 40 f ept 1 51 2 90 48 m ept 2 28 2 244 f ept 2 26 2 68 61 m ept 2 21 3 33 60 f ept 2 20 3 109 62 m ept 2 193 76 67 m ept 1 17 3 130 55 m mixed 2 6 3 166 66 m sarc 2 6 3 67 49 fept 2 6 3 229 33 f ept 2 5 3 6 39 m ept 2 5 3 89 55 m mixed 2 3 3 133 69m mixed 2 2 3 114 51 m mixed 2 2 3 159 62 m sarc 2 2 3 ^(a)ept.,epithelial; sarc., sarcomatoid ^(b)1, alive without disease; 2, alivewith disease; 3, dead with disease; 4, dead other causes; U, unknown

TABLE 9B Clinical characteristics of MPM tumors, Test Set Test Set AgeBWH Survival Sample (years) Sex Histology^(a) Stage (months) Status^(b)169 46 m ept 2 7 3 146 67 m ept 2 7 3 219 39 m ept 2 6 1 104 40 m ept 25 3 110 64 m ept 2 5 3 112 31 m ept 2 55 3 165 51 m ept 2 27 2 5 51 mept 2 8 3 148 51 m ept 2 17 3 96 40 m ept 2 1 3 134 56 m ept 2 1 4 21643 f ept 2 8 1 208 63 f ept 2 7 1 224 68 f ept 2 6 1 225 35 f ept 2 42 2163 68 f ept 2 25 1 235 46 m mixed 2 24 3 206 45 m mixed 2 45 2 107 69 mmixed 2 16 3 302 55 m mixed 2 13 3 161 59 m mixed 2 12 3 220 71 m mixed2 12 3 217 57 m mixed 1 5 1 150 58 m mixed 2 3.6 3 44 57 m mixed 2 2 4222 57 m mixed 2 1 U 154 56 f mixed 2 9 3 70 57 m sarc. 2 8 3 228 73 msarc. 2 4 3 ^(a)ept., epithelial; sarc., sarcomatoid ^(b)1, alivewithout disease; 2, alive with disease; 3, dead with disease; 4, deadother causes; U, unknown

Prediction of outcome using gene expression ratios. We chose the fourgenes most significantly overexpressed in each group (Table 10) todetermine whether expression ratios could accurately classify the 17samples used to train the model. We calculated a total of 16 possibleexpression ratios per sample by dividing the expression value of each ofthe 4 genes (i.e., SBP, KIAA0977 protein, L6 EST, LAR) expressed atrelatively higher levels in good outcome samples by the expression valueof each of the 4 genes (i.e., CTHBP, calgizzarin, IGFBP-3, GDIA1)expressed at relatively higher levels in poor outcome samples. Sampleswith ratio values >1 were predicted to be “good outcome” and those withratio values <1 were predicted to be “poor outcome”. The five mostaccurate ratios singularly identified 88% (15/17) of the samples used totrain the model. To incorporate the predictive accuracy of multipleratios, we calculated the geometric mean (see Methods) for all possible3-ratio combinations (formed using these 5 ratios) and found that wecould identify training samples with accuracy that met or exceeded thatof any of the gene pair ratios when used alone (average=94%, range88%-100%). For further analysis, we chose one of the two 3-ratiocombinations that correctly classified 100% (17/17) of the trainingsamples. A total of 4 genes were used in this 3-ratio test:KIAA0977/GDIA1, L6/CTHBP, and L6/GDIA1.

TABLE 10 Mesothelioma prognostic genes Accession # P value Ratio^(a)Description Expressed at relatively higher levels in good outcome tumorsU29091 0.0033 2.8 selenium-binding protein (SBP) AB023194 0.0065 2.1KIAA0977 protein AI445461 0.0073 3.0 EST (similar to L6 tumor antigen)Y00815 0.0077 2.0 leukocyte antigen related protein (LAR) D84424 0.00946.0 hyaluronan synthase Y00318 0.0103 3.6 complement control proteinfactor I AL049963 0.0103 3.7 EST AJ223352 0.0142 3.5 histone H2BAB000220 0.0181 2.3 semaphorin E L39945 0.0182 2.5 cytochrome b5 (CYB5)M90657 0.0256 2.8 L6 tumor antigen AB002301 0.0257 2.1 KIAA0303 proteinExpressed at relatively higher levels in poor outcome tumors M262520.0013 0.38 cytosolic thyroid hormone-binding protein (CTHBP) D385830.0041 0.43 calgizzarin *M35878 0.0046 0.35 insulin-like growthfactor-binding protein-3 (IGFBP- 3) X69550 0.0063 0.47 GDP-dissociationInhibitor 1 (GDIA1) M95787 0.0068 0.33 22 kDa smooth muscle protein(SM22), AKA transgelin AB023208 0.0069 0.49 KIAA0991 protein X957350.0105 0.43 zyxin AA976838 0.0131 0.40 EST U90878 0.0132 0.49 carboxylterminal LIM domain protein (CLIM1) *M35878 0.0135 0.30 insulin-likegrowth factor-binding protein-3 (IGFBP- 3) U53204 0.0169 0.39 plectin(PLEC1) M95178 0.0215 0.40 non-muscle alpha-actinin ^(a),averageexpression level in good outcome samples/average expression level inpoor outcome samples *IGFBP-3 is listed twice in the lower portion ofthe table because this gene is represented by multiple Affymetrix probesets.

Verification of microarray data. Next, we utilized quantitative RT-PCRto verify gene expression levels measured in microarray-based analysis.We randomly chose 3 samples each from both groups: the good outcomegroup (samples 74, 33, and 68) and the poor outcome group (samples 89,229, and 67). Using RT-PCR, we determined the relative expression levelof all 4 prognostic genes (L6, GDI, CTHBP, and KIAA0977 protein) inthese 6 samples. Then, we calculated the 3 individual ratios previouslyused to predict outcome: KIAA0977/GDIA1, L6/CTHBP, and L6/GDIA1.Finally, we calculated the geometric mean of these 3 ratios and comparedthe magnitude and direction (i.e. >1 or <1) of this number to thatobtained using microarray analysis. We found that classification usingthe 3-ratio geometric means calculated with data from both platformswere in perfect agreement for all 6 samples (FIG. 6C).

Validation of the model. We utilized a “leave-one-out” cross validationtechnique (16,19,20) to assess the internal variation of a 3-ratiopredictor model. For this analysis, we analyzed 17 different trainingsets by withholding 1 of the 17 samples to construct a new expressionratio-based classifier exactly as before and then predicting the class(either good or poor outcome) of the withheld sample. This process wasrepeated sequentially for the remaining 16 samples. We found that 88%(15/17) of the samples were correctly identified in this analysis(P=0.0034, Fisher's exact test).

Verification of expression level ratios as outcome predictors. Finally,we tested the ability of expression ratios to predict outcome in a newcohort of mesothelioma tumor samples not subjected to microarrayanalysis (n=29, the test set, Table 9B). The histological distributionand the estimated median patient survival (12 months, FIG. 7A) of thetest set of samples was also representative of those of mesotheliomapatients in our practice (6). As before, we found that histologicalsubtype was not strongly predictive of survival in the new cohort ofsamples (P=0.345, log-rank test, FIG. 7B) We used quantitative RT-PCR todetermine relative expression levels for the 4 predictor genes andcalculated the geometric mean of 3 prognostic ratios: KIAA0977/GDIA1,L6/CTHBP, and L6/GDIA1. Similarly, samples with geometric means >1 and<1 were assigned to good outcome and poor outcome groups, respectively.A total of 11 samples were assigned to the good outcome group and 18 tothe poor outcome group. The number of test set samples “correctly”classified was estimated using the median survival (12 months) of theentire cohort as a cut-off to form 2 groups: relatively good outcome(>12 month survival) and relatively poor outcome (≦12 month survival).Only those 17 samples from patients that died from disease wereconsidered (status 3, Table 9). We found that the exact same number oftest set samples were classified correctly in this analysis (88%, 15/17)as in the analysis of the training set. To include all-samples in theassessment of the model, we performed Kaplan-Meier survival analysisusing expression ratio predictions made for the test set of samples. Theestimated median survival for the good outcome group (36 months) wasover 5-fold higher than that for the poor outcome group (7 months). Inaddition, we found that the 3-ratio geometric mean model significantly(P=0.0035, log-rank test, FIG. 7C) predicted outcome in the new set ofsamples. Since it has been demonstrated in very large sample cohortsthat patients with epithelial histology generally enjoy significantlylonger disease-free survival than patients with non-epithelial histology(22), we used multivariate analysis to examine whether our results usingexpression ratios were independent of the histological subtype of thetumor. By fitting a Cox proportional-hazards regression model, we foundthat the (3-ratio) geometric mean value significantly predicts outcome(P=0.0094, hazard ratio=2.6) independent of the histological subtype ofthe tumor (P=0.75, hazard ratio=0.32). Expression ratios correctlypredicted outcome independently of the histological subtype of the tumorin the new set of samples, indicating the ratio method is a betterprognostic tool.

Discussion

Current methods of prognosis in mesothelioma include stage and histologyat the time of surgery. However, these techniques are not completelyreliable and accurate staging usually requires extensive surgery(3,8,9). Recently, we discovered that simple ratios of gene expressioncan be used to accurately diagnose cancer (17) while successfullyavoiding many of the shortcomings which preclude the use of othermicroarray analytical techniques in wider clinical applications (10,20).In this study, we describe a technique that uses expression data fromfour genes to independently predict outcome in mesothelioma patients whoundergo extrapleural pneumonectomy followed by standard chemoradiationtherapy. Although this analysis only utilized four genes, the expressionratio technique can easily incorporate larger numbers of genes whenrequired for acceptable accuracy. To our knowledge, this is the firststudy in human cancer to use expression profiling techniques to identifytreatment related prognostic markers in cancer for use in thedevelopment of an outcome predictor model, and to validate the model inan independent cohort using a simpler data acquisition platform such asRT-PCR. Other investigators have tested outcome predictor models inindependent samples (16), but studies of this sort continue to behindered in their clinical applicability through their reliance onrelatively large numbers of genes, costly data acquisition platforms(i.e. microarrays), the need for sophisticated algorithms/software, andthe inability to analyze a sample independently and without reference toother samples.

The prognostic tool described herein could dramatically impact thecurrent clinical treatment of mesothelioma by identifying preoperativelypatients not likely to respond to conventional treatment modalities thussparing them from radical surgery. It is currently our practice toobtain a tissue diagnosis prior to recommending therapy for patientswith mesothelioma, but the absence of suitable prognostic molecularmarkers make it difficult to assign optimal treatments or investigatenew modalities. The results of this work, if confirmed prospectively ina larger patient population, should prove helpful in the development ofmeaningful clinical trials for patients with mesothelioma. Wehypothesize that patients whose tumors are analyzed using geneexpression ratios and predicted to have relatively poor outcomes areexcellent candidates for neo-adjuvant chemotherapy protocols as they areunlikely to benefit from upfront surgery, whereas patients predicted tohave relatively good outcomes are more likely to enjoy long termsurvival after conventional surgical and adjuvant chemoradiation.

The use of gene expression ratios to predict patient outcome inmesothelioma and other cancers (17) overcomes several major obstacles tothe clinical use of microarray data. Unlike other widely acceptedsupervised learning techniques with similar predictive accuracy(10,16,20), the expression ratio method generates a simple numericalmeasure that can be used to predict clinical outcome using a singlebiopsy specimen. Since this non-linear function of gene expression is aunit-less number and does not require data from additional trainingsamples or from additional reference genes, expression levels can bemeasured using any reliable method including quantitative RT-PCR, cDNAand oligonucleotide microarrays, SAGE, or perhaps ELISAs for encodedproteins. The expression ratio technique can also facilitate examinationof microarray data by investigators without direct access tosophisticated analytical tools. Using previously published data, we havecreated ratio-based tests using small numbers of genes that can diagnoselocalized prostate cancer and predict clinical outcome in breast cancer(see Example 4).

We believe that attempts to bridge the gap between expression profilingstudies in cancer and meaningful clinical applications should follow thegeneral spirit of Occam's Razor principle: “among a set of otherwiseequal models, choose the simplest”. Although other microarray-basedpredictor models in cancer may utilize relatively small numbers of genesto accurately predict outcome (16,19,20), these approaches continue tobe limited in their clinical applicability. Furthermore, it has yet tobe determined if these approaches can utilize relatively low-cost andwidely available data acquisition platforms such as RT-PCR and retainsignificant survival predictions. The expression ratio technique isfundamentally similar to other widely accepted bioinformatics techniques(10) in that it utilizes genes with inversely correlated expressionlevels in multiple groups. The principal advantages to the use ofexpression ratios in predicting clinical parameters is their relativesimplicity, platform independence for data acquisition, and requirementfor small quantities of fresh or frozen tissue for analysis. Inaddition, these tests are relatively low cost and can be used to analyzesamples independent of a training set. For this reason, it is likelythat the expression ratio technique will find additional uses in theclinical management of other cancers and diseases.

REFERENCES FOR EXAMPLE 3

-   1. Pass H. Malignant pleural mesothelioma: Surgical roles and novel    therapies. Clin Lung Cancer 2001; 3:102-117.-   2. Aisner J. Diagnosis, staging, and natural history of pleural    mesothelioma. In: Aisner J, Arriagada R, Green M R, et al, Aisner J,    Arriagada R, Green M R, et als. Comprehensive Textbook of Thoracic    Oncology. Baltimore (Md.): Williams and Wilkins; 1996.799-785.-   3. Ong S-T, Vogelsang N J. Current therapeutic approaches to    unresectable (primary and recurrent) disease. In: Aisner J,    Arriagada R, Green M R, et al, Aisner J, Arriagada R, Green M R, et    als. Comprehensive Textbook of Thoracic Oncology, Baltimore (Md.):    Williams and Wilkins; 1996.799-814.-   4. Peto J, Hodgson J T, Matthews F E, Jones J R. Continuing increase    in mesothelioma mortality in Britain. Lancet 1995; 345:535-539.-   5. Sugarbaker D J, Flores R M, Jaklitsch M T, Richards W G, Strauss    G M, Corson J M, et al. Resection margins, extrapleural nodal    status, and cell type determine postoperative long-term survival in    trimodality therapy of malignant pleural mesothelioma: results in    183 patients. J Thorac Cardiovasc Surg 1999; 117:54-65.-   6. Sugarbaker D J, Garcia J P, Richards W G, Harpole D H, Jr.,    Healy-Baldini. E, DeCamp M M, Jr., et al. Extrapleural pneumonectomy    in the multimodality therapy of malignant pleural mesothelioma.    Results in 120 consecutive patients. Ann Surg 1996; 224:288-294.-   7. Sugarbaker D, Strauss G M, Lynch T J, Richards W, Mentzer S J,    Lee T H, et al. Node status has prognostic significance in the    multimodality therapy of diffuse, malignant mesothelioma. J Clin    Oncol 1993; 11:1172-1178.-   8. Corson J M, Renshaw A A. Pathology of mesothelioma. In: Aisner J,    Arriagada R, Green M R, et al, Aisner S, Arriagada R, Green M R, et    als. Comprehensive Textbook of Thoracic Oncology. Baltimore (Md.):    Williams and Wilkins; 1996.757-758.-   9. Ordonez N G. The value of antibodies 44-36A, SM3, HBME-1, and    thrombomodulin in differentiating epithelial pleural mesothelioma    from lung adenocarcinoma. Am J Surg Pathol 1997; 21:1399-1408.-   10. Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov    J P, et al. Molecular classification of cancer: class discovery and    class prediction by gene expression monitoring. Science 1999;    286:531-537.-   11. Perou C M, Sorlie T, Eisen M B, van de Rijn M, Jeffrey S S, Rees    C A, et al. Molecular portraits of human breast tumours. Nature    2000; 406:747-752-   12. Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R,    et al. Gene expression profiles in hereditary breast cancer. N Engl    J Med 2001; 344:539-548.-   13. Khan J, Wei J S, Ringner M, Saal L H, Ladanyi M, Westermann F,    et al. Classification and diagnostic prediction of cancers using    gene expression profiling and artificial neural networks. Nat Med    2001; 7:673-679.-   14. Welsh J B, Sapinoso L M, Su A I, Kem S G, Wang-Rodriguez J,    Moskaluk C A, et al. Analysis of gene expression identifies    candidate markers and pharmacological targets in prostate cancer.    Cancer Res 2001; 61:5974-5978.-   15. Dhanasekaran S M, Barrette T R, Ghosh D, Shah R, Varambally S,    Kurachi K, et al. Delineation of prognostic biomarkers in prostate    cancer Nature 2001; 412:822-826.-   16. van 't Veer L J, Dai H, van de Vijver M J, He Y D, Hart A A M,    Mao M; et al. Gene expression profiling predicts clinical outcome of    breast cancer. Nature 2002; 415:530-536.-   17. Gordon G J, Jensen R V, Hsiao L-L, Gullans S R, Blumenstock J E,    Ramaswami S, et al. Translation of microarray data into clinically    relevant cancer diagnostic tests using gene expression ratios in    lung cancer and mesothelioma. Cancer Res 2002; In Press.-   18. Tusher V G, Tibshirani P, Chu G. Significance analysis of    microarrays applied to the ionizing radiation response. Proc Natl    Acad Sci USA 2001; 98:5116-5121.-   19. Pomeroy S L, Tamayo P, Gaasenbeek M, Sturla L M, Angelo M,    McLaughlin M E, et al. Prediction of central nervous system    embryonal tumor outcome based on gene expression. Nature 2002;    415:436-442.-   20. Shipp M A, Ross K A, Tamayo P, Weng A P, Kutok J L, Aguiar R C    T, et al. Difffuse large B-cell lymphoma outcome prediction by gene    expression profiling and supervised machine learning. Nat Med 2002;    8:68-74.-   21. Venables W N, Riley B D: Modern Applied Statistics with S-Plus.    New York (N.Y.): Springer; 1997.-   22. Sugarbaker D J, Liptay M J. Therapeutic approaches in malignant    mesothelioma. In: Aisner J, Arriagada R, Green M R. et al, Aisner J,    Arriagada A, Green M R, et als. Comprehensive Textbook of Thoracic    Oncology. Baltimore (Md.): Williams and Wilkins; 1996.786-798.

Example 4 Diagnostic and Prognostic Tests in Prostate and Breast Cancerfrom Expression Profiling Data

Current gene expression profiling-based bioinformatics tools are highlyaccurate in the diagnosis and prognosis of cancer (1-6). However, thewidespread clinical applicability of these techniques is currentlylimited owing largely to a lack of a practical method for translatingcomplex profiling analyses to functional clinical tests. To address thisissue, we have created a simple yet effective technique with broad andimmediate clinical applicability for performing relatively low costdiagnosis and prediction of prognosis in cancer (see Examples above andreference 7). Our method utilizes a supervised comparison of extensivegene profiling data to identify differentially expressed genes betweentwo groups Carefully chosen genes are then used to calculate simpleexpression ratios which in turn are set to predict (in a binarynumerical manner) the clinical parameter in question. To date, we havedemonstrated the applicability of this method in distinguishingmesothelioma from lung adenocarcinoma (see Examples above and reference7), in identifying patients with favorable prognosis after surgery formesothelioma, and in predicting patients with favorable outcome aftertreatment for medulloblastoma (see Examples above and reference 7). Inthis study, we have tested the accuracy of ratio-based predictions intwo separate applications: the diagnosis of prostate cancer and theprediction of clinical outcome in early stage, node-negative resectedbreast cancer. By using multiple previously published datasets to trainand validate our predictor models, we have also directly tested thehypothesis that this gene expression ratio technique is platformindependent and can be utilized in widespread fashion by large numbersof clinical and translational investigators.

Methods

Tumor tissues. Ten sets of matched normal adjacent prostate andmalignant prostate cancer (20 specimens total) were obtained from theTumor Bank at Brigham and Women's Hospital. Studies utilizing humantissues were approved by and conducted in accordance with the policiesof the Institutional Review Board at Brigham and Women's Hospital.

Expression profiling data. Microarray data for prostate tissues wasobtained from two sources. Gene expression data composing the initial“training set” were obtained using a 9,984-element cDNA microarray (12)and consisted of PCA (n=14) and a group (n=18 total) composed of bothNAP (n=4) and BPH (n=14) tissues (Supplemental FIG. 8 Data). When therewas no data for a given gene due to a technical artifact, weconservatively assumed no change in expression level from the pooledreference mRNA. Gene expression data composing the initial “test set”were obtained using Affymetrix high-density oligonucleotide microarrayswith probe sets representing approximately 12,000 genes (13) andconsisted of NAP (n=9) and PCA (n=25) tissues. For this dataset, wescaled gene hybridization intensities (i.e. “.cel” files) to a “targetintensity” of 100 using Affymetrix Genechip® Software, v.5.0(Affymetrix, Santa Clara, Calif.). Gene expression data for breastcancer tissues were obtained from a single source using a microarraycontaining approximately 25,000 genes (6). The “training set” consistedof two groups of samples: those from 44 patients with greater than 5years disease-free survival (i.e., relatively good outcome) and thosefrom 34 patients with less than 5 years disease-free survival (i.e.,relatively poor outcome). The “test set” consisted of 19 additionalprofiled patient samples.

Real time quantitative RT-PCR. Quantitative RT-PCR was performed asdescribed in the examples above and in reference (7). Primer sequenceswere as follows:

HPN 5′-AATACATCCAGCCTGTGTGC-3′ (SEQ ID NO: 95) and5′-TGGCCATAGTACTGCGTGTT-3′; (SEQ ID NO: 96) MEIS25′-TTAGCGCAAGACACAGGACT-3′ (SEQ ID NO: 97) and5′-CACTCGTCGATTTGACTGGT-3′; (SEQ ID NO: 98) C75′-TCAAAATGGTGGTTTGGCTA-3′ (SEQ ID NO: 99) and5′-CCTACGAGGACTCCTTGCTC-3′; (SEQ ID NO: 100) and FN15′-GCCATGACAATOGTGTGAAC-3′ (SEQ ID NO: 101) and5′-GCAAATGGCACCGAGATATT-3′. (SEQ ID NO: 102)

Data and statistical analysis. The selection of predictor genes for usein expression ratio-based diagnosis and prognosis was performedessentially as described in the examples above and in reference (7).Basically, a two-sided Student's (parametric) t-test was used forpair-wise comparisons of average gene expression levels among multiplegroups to select predictor genes that have highly significant, inverselycorrelated average expression levels. Data from multiple highly accurategene expression ratios were combined by calculating the geometric mean,thereby giving equal weight to ratio fold-changes of identical magnitudebut opposite direction. The classification accuracy of selected ratioswas assessed using Fisher's exact test. Kaplan-Meier time-to-relapseanalysis was used to assess disease-free survival. The log-rank test wasused to statistically assess differences among multiple survival curves.All differences were determined to be statistically significant ifP<0.05. All calculations and statistical comparisons were generatedusing S-PLUS (14).

Results and Discussion

Diagnosis of prostate cancer using gene expression ratios. Prostatecancer is exceedingly common among males in the U.S. (8). Unfortunately,widespread serum prostate-specific antigen (PSA) screening has beenfound to present major drawbacks (9-11). For this reason, patients whoare PSA-positive and at a moderate or high risk for prostate cancerundergo a core needle biopsy of the prostate for definitive diagnosis, aprocedure associated with substantial patient discomfort. To improve thediagnostic accuracy, decrease the discomfort, and reduce the resultingnon-compliance associated with current methodology, we explored thefeasibility of designing a less invasive diagnostic test for prostatecancer. To accomplish this, we designed an expression ratio-based testwhich would utilize RT-PCR for data acquisition, and by virtue of thequantity of RNA needed (e.g., <100 pg), would likely support sampleattainment using fine needle aspirations (FNA).

We identified two published reports that provide extensive geneprofiling data from prostate cancer and non-malignant prostate tissues(12, 13). We used data from one manuscript to develop our training setand data from the other as our test set (see Methods for details). Tocreate an expression ratio-based diagnostic test capable ofdistinguishing prostate cancer (PCA) from either normal adjacentprostate or benign prostatic hypertrophy (NAP and BPH, respectively) wefirst identified a total of 19 known genes with inversely correlatedaverage expression levels in the training set that matched our filteringcriteria (P<0.01, at least a 2-fold difference in mean expression levelsbetween PCA and NAP/BPH). We chose 11 of these genes for furtheranalysis since they were also represented on the expression profilingplatform of the test set (Table 11).

TABLE 11 Prostate cancer diagnostic genes P value P value LocusLinkAccession # training set test set Symbol Description Expressed atrelatively higher levels in NAP/BPH AA424743 1.5 × 10⁻⁷ — BRF1 butyrateresponse factor 1 (EGF- response factor 1) AA418773 2.3 × 10⁻⁷ — HPSHermansky-Pudlak syndrome AA148641 2.8 × 10⁻⁷  1.9 × 10⁻¹⁰ MEIS2 Meis(mouse) homolog 2 R98851 0.0012 — CALLA common acute lymphoblasticleukemia antigen AA598478 0.0036 0.015 C7 complement component 7 R626120.0070 9.8 × 10⁻⁵ FN1 fibronectin 1 Expressed at relatively higherlevels in PCA H50323 2.5 × 10⁻⁸ 7.0 × 10⁻⁵ FASN fatty acid synthaseH62162 1.0 × 10⁻⁶ 1.4 × 10⁻⁸ HPN hepsin AA460115 2.8 × 10⁻⁶ 0.22  ODC1ornithine decarboxylase 1 N26311 3.9 × 10⁻⁵ 2.5 × 10⁻⁴ PLAB prostatedifferentiation factor AA454207 3.4 × 10⁻⁴ — LABH2 putativetransmembrane protein —, these genes were not reliably detected (i.e.,average expression levels >600) in at least one group of the test setand were not given further consideration.

Using these 11 genes, we calculated 30 expression ratios per sample bydividing the expression value of each of the 6 genes expressed atrelatively higher levels in NAP/BPH by the expression value of each ofthe 5 genes expressed at relatively higher levels in PCA. Then, wetested the diagnostic accuracy of these ratios in the 28 training setsamples obtained from the same study. Samples with ratio values >1 werecalled NAP/BPH and those with ratio values <1 were called PCA. Notsurprisingly, we found that these 30 ratios could be used to correctlydistinguish between non-malignant tissues and PCA with a high degree ofaccuracy (average=86%, range 76%-100%).

To further refine our diagnostic tool, we examined the expressionpatterns of the 11 genes identified in the training set in a new cohortof samples (i.e. the test set) for which published data was availablefrom another laboratory Four genes were discarded because they were notreliably detected in at least one group on the profiling platform of thetest set (i.e., average expression level <600 in both NAP and PCAsamples). Of the remaining 7 genes, only one (ODC1) was not expressed atsignificantly different levels in test set samples and was not givenfurther consideration (Table 11). We formed a total of 9 possible ratiosfrom the remaining 6 genes and found that all possessed similarly highaccuracy in diagnosing test set samples (average=93%, range 88%-100%).To utilize more than two discriminating genes, we calculated thegeometric mean of the 3 most accurate individual ratios and examined theability of this 3-ratio (4-gene) test, C7/HPN, MEIS2/HPN, and FN1/HPN,to diagnose test set samples. As expected, we found that the accuracy ofthis 3-ratio test remained high (97%, 33/34). The HPN gene is used inall 3 of the ratios, indirectly corroborating the results of theoriginal analyses of these datasets showing this gene to be highlyexpressed in PCA (12, 13). Finally, we validated this 3-ratio diagnostictest in an independent set of discarded NAP (n=10) and PCA (n=10)patient specimens using quantitative RT-PCR performed at ourinstitution. We found this technique to be highly accurate (90%, 18/20)in classifying these samples (P=0.0007, Fisher's exact test). In bothmisclassifications, normal prostate was diagnosed as PCA, but no PCAspecimens were diagnosed as non-cancer.

Despite the fact that diagnostic genes were chosen from a training setin which non-malignant tissues were composed primarily of 3 μL, the generatios remain accurate in distinguishing cancer from the non-malignantsamples in the test set which were exclusively NAP. To see if theinverse is true, we reversed the training and test sets and identifiedpredictor genes in exactly the same manner as above. Of the 4 genes usedin the diagnostic test from above, only HPN was listed among the 10 mostsignificant genes overexpressed in either group in the new training set.This finding may be attributed to the larger numbers of genes on thistraining set profiling platform and/or the fact that NAP and BPH do nothave perfectly overlapping expression patterns, but enough similaritiesin key genes to be mutually distinctive from PCA. To test thediscriminating nature of these new predictor genes, we chose the 4 genesmost significantly overexpressed in each group (8 genes total) andpresent on both profiling platforms (LocusLink symbol, P value in newtraining set): DJ742C19.2, P=10⁻¹³, FHL1, P=4.8×10⁻¹², SEC23A,P=7.5×10⁻¹¹, ATP2A2, P=10⁻¹⁰, HPN, P=1.3×10⁻⁸, KLK3, P=1.3×10⁻⁶, LU,P=3.7×10⁻⁶, LIM, P=4.0×10⁻⁶. In the new test set, we discovered that 4of these 8 genes were expressed at significantly different levels: HPN(P=10⁻⁶), SEC23A (P=3.4×10⁻⁴), and LIM (10⁻⁴), and KLK3 (P=0.049). Asbefore, a total of 16 possible ratios were calculated using these 8genes and used to diagnose samples in the new test set. The accuracy ofthese ratio varied greatly (average=75%, range 41%-91%). We combined thefour most accurate individual ratios (≧=88% accuracy) and found thatthis 4-ratio test was actually slightly less accurate (84%, 27/32) thanany single ratio. No normal samples were misdiagnosed in this test andthree of the five errors resulted from BPH samples diagnosed as PCA.These observations combine to suggest that genes found to bediscriminatory between BPH and PCA are effective in distinguishingbetween NAP and PCA, but the reverse is less likely to result inaccurate stratification. Another, possible explanation is that theplatform used in the second set of experiments is not sufficientlyextensive to include the best diagnostic genes for this application.Nevertheless, the gene ratio technique was effective in producingrelatively accurate and cancer sensitive diagnostic tests across twoplatforms and in both directions.

Prediction of prognosis in breast cancer using gene expression ratios.Breast cancer is the most common malignancy in women (in 2001) and isthe second highest cause of cancer death in North American women(www.cancer.org). Breast cancer gene expression signatures have recentlybeen used to stratify tumor samples into prognostic groups based oncancer recurrence (6). In this context, tumors were obtained from womenwho underwent surgical resection for lymph node negative breast cancer.“Good prognosis” was defined as disease-free survival for at least 5years and “poor prognosis” was defined as the development of distantmetastases within a 5 year period. An optimal 70-gene classifier wasidentified and validated in an independent set of tumors. Although theclassifier described in this manuscript appears highly accurate andreproducible, there are several limitations to the rapid incorporationof these results into a clinically relevant test. For one, van 't Veerand colleagues ranked tumors for comparison to classification thresholdsby comparing the correlation of predictor genes to the average “goodprognosis” profile taken from data acquired on the same platform. Thisrelative measure of contrast relies upon absolute expression levelsobtained using microarrays. Unfortunately, it is not certain that analternative data acquisition platform will produce similarly accurateresults. Also, by definition, this technique cannot classify anindividual sample without reference to data from additional samples.

We hypothesized that the expression ratio technique could classifysamples with similar or greater accuracy to that described by van 'tVeer et al. while requiring substantially fewer genes. To test thishypothesis, we identified predictor genes with inversely correlatedaverage expression levels in the same training set as used by van 'tVeer et al. and composed of good prognosis samples (n=44) and poorprognosis samples (n=34). We found 8 genes that fit our filteringcriteria (P<0.01, at least a 2-fold difference in mean expressionlevels), 4 genes each overexpressed in good and poor prognosis samples(Table 12). We calculated all 16 possible expression ratios per trainingset sample by dividing the expression value of each of the 4 genesexpressed at relatively higher levels in good prognosis samples by theexpression value of each of the 4 genes expressed at relatively higherlevels in poor prognosis samples. Samples with ratio values >1 wereclassified as good prognosis and those with ratio values <1 wereclassified as poor prognosis. The classification accuracy of these 16ratios in the training set varied widely (average=70%, range 59%-80%) sowe determined the classification accuracy of multiple ratios combined ina single test. Beginning with the three most accurate ratios, we addedadditional ratios in descending order of accuracy to form a total ofthree multiple-ratio tests. These tests used 3, 4, and 6 individualratios and were 85%, 83%, and 84% accurate in the training set,respectively, demonstrating that the combination of multiple ratios inthis analysis exceeds the classification accuracy of the single mostaccurate individual ratio. (The 6-ratio test incorporated two additionalequally accurate ratios.) Only one of eight predictor genes (ASAH2) wasnot used in any of the three multiple-ratio tests.

TABLE 12 Breast cancer prognostic genes LocusLink ^(a)Gene ID P valueSymbol Description Expressed at relatively higher levels in goodprognosis NM_003862 1.7 × 10⁻⁴ FGF18 fibroblast growth factor 18*Contig47178_RC 3.6 × 10⁻³ — EST NM_003147 4.9 × 10⁻³ SSX2 synovialsarcoma, X breakpoint 2 NM_019893 6.5 × 10⁻³ ASAH2 N-acylsphingosineamidohydrolase (non- lysosomal ceramidase) 2 Expressed at relativelyhigher levels in poor prognosis AL080059 2.0 × 10⁻⁶ KIAA1750 KIAA1750protein NM_006681 4.0 × 10⁻⁵ NMU neuromedin U *Contig29050_RC 5.1 × 10⁻³— EST NM_000340 7.7 × 10⁻³ SLC2A2 solute carrier family 2 (facilitatedglucose transporter), member 2 ^(a),sequences in training set expressionprofiling platform were identified by their GenBank Accession number orEST contig number (6). —, not available *,these sequences were nothomologous to any known genes at the time of this study (BLAST search,http://www.ncbi.nlm.nih.gov/BLAST/)

We then examined these three multiple-ratio tests in the same test setused by van 't Veer which consisted of 19 additional samples: 7 goodprognosis samples and 12 poor prognosis samples (6). We discovered thatall 3 sets of gene ratio tests were able to distinguish test set sampleswith at least 80% accuracy. The most successful gene ratio combinationcorrectly identified 84% (16/19) of the test set samples (P=0.0055,Fisher's exact test) utilizing 5 genes (from Table 12) in 4 ratios:SSX2/KIAA1750, Contig47178_RC/KIAA1750, FGF18/KIAA1750, and FGF18/NMU.These results are nearly as accurate as those obtained with theoptimized 70-gene classifier developed by van 't Veer and colleagues forthe same dataset (6). Their classifier correctly identified 17/19samples but required an 65 additional genes.

We performed a final analysis of the gene profiling data obtained frombreast cancer tissues to develop a model optimized for sensitivity. Asnoted by van 't Veer and colleagues (6), it is desirable for therapeuticpurposes to minimize the number of poor prognosis samples assigned tothe good prognosis category in order to ultimately capture all patientsat risk of recurrence for adjuvant systemic therapy. Since all threemisclassified samples in our best 4-ratio test were samples obtainedfrom patients with poor prognosis, we analyzed multiple-ratio testsexactly as above, with the exception that individual ratios were rankedaccording to classification accuracy in the poor prognosis group only.As predicted, these tests accurately classified poor prognosis samples,but we discovered that they also remained relatively accurate overall(poor prognosis accuracy, overall accuracy): 3 ratios (91%, 75%), 5ratios (88%, 82%), and 6 ratios (91%, 83%). (Two equally accurate ratioswere added to the 3-ratio test). In the test set of samples, the 3- and5-ratio tests misclassified only 1 of 12 poor prognosis samples, buteach resulted in 4 misclassifications overall (79%, 15/19). The 6-ratiotest also resulted in accurate identification of 11 of 12 poor prognosispatients and only 3 overall errors (84%, 16/19, P=0.00954, Fisher'sexact test) using 6 genes (from Table 12): FGF18/SLC2A.FGF18/Contig29050_RC, SSX2/SLC2A, SSX2/Contig29050-RC, FGF18/KIAA1750,and FGF18/NMU. We performed Kaplan-Meier time-to-relapse analysis usingpredictions made from this test in the 19 test set samples and found asignificant difference (P=0.0197, FIG. 8) between groups predicted tohave widely divergent disease-free survival times. These resultsindicate that ratios chosen for enhanced sensitivity perform similarlywell in the test set samples without any substantial sacrifice inoverall accuracy. There are two individual ratios in common between this6-ratio test and the best 4-ratio test we used to initially develop aclassifier based only on overall accuracy. Although both tests resultedin only 3 misclassifications in the test set (n=19), we found the6-ratio test to be more sensitive.

It is important to note that we have not proposed an exact protocol fordeveloping and testing ratio-based predictor models. In fact, wediscovered in this study and others (7) that multiple combinations ofgenes, in the form of ratios, can achieve similarly accurate results. Wemerely assert that simple ratios can be a highly accurate means ofpredicting clinical parameters using very small numbers of genes andsimpler data acquisition platforms, such as quantitative RT-PCR and/orcustom microarrays. Furthermore, this strategy can be used to analyzemicroarrays without the need for additional reference samples. In thecase of prostate cancer, we envision diagnosis using mRNA obtained fromfine needle aspirations would be less invasive than current biopsytechniques and would likely increase compliance and reduce discomfort inmen whose prostate-specific antigen levels mandate frequent screening.Similarly, women with breast cancer undergoing initial diagnostic biopsycould have tissue saved for a similar gene expression ratio based testusing quantitative RT-PCR or a custom microarray. Women found to be athigh risk for recurrence may be selected for either neo-adjuvantchemotherapy or post-surgical adjuvant therapy. The gene ratio methodthus presents an opportunity to translate initial microarray based geneexpression profiling to simple clinical tests that are performed usingquantitative RT-PCR, microarrays, or other platforms on materialobtained surgically or from fine needle aspirations.

REFERENCES FOR EXAMPLE 4

-   1. Shipp, M. A., Ross, K. A., Tamayo, P., Weng, A. P., Kutok, J. L.,    Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.    S., Ray, T. S., Koval, M. A., Last, K. M., Norton, A., Lister, T.    A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C., and    Golub, T. R. Diffuse large B-cell lymphoma outcome prediction by    gene expression profiling and supervised machine learning, Nat. Med.    8: 68-74, 2002.-   2. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek,    M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R.,    Caligiuri, M. A., Bloomfield, C. D., and Landers, E. S. Molecular    classification of cancer: class discovery and class prediction by    gene expression monitoring, Science. 286: 531-537, 1999.-   3. Perou, C. M., Sorlic, T., Eisen, M. B., van de Rijn, M.,    Jeffrey, S. S., Rees, C. A., Pollack, J. R., Ross, D. T., Johnsen,    H., Akslen, L. A., Fluge, O., Pergamenschikov, A., Williams, C.,    Zhu, S. X., Lonning, P. B., Borresen-Dale, A.-L., Brown, P. O., and    Botstein, D. Molecular portraits of human breast tumours, Nature.    406: 747-752, 2000.-   4. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M.,    Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi,    O.-P., Wilfond, B., Borg, A., and Trent, J. Gene expression profiles    in hereditary breast cancer, N. Engl. J. Med. 344: 539-548, 2001.-   5. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M.,    Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R.,    Peterson, C., and Meltzer, P. S. Classification and diagnostic    prediction of cancers using gene expression profiling and artificial    neural networks, Nat. Med. 7: 673-679, 2001.-   6. van 't Veer, L. Y., Dai, H., van de Vijver, M. J., He, Y. D.,    Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K.,    Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M.,    Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. Gene    expression profiling predicts clinical outcome of breast cancer,    Nature. 415: 530-536, 2002.-   7. Gordon, G. J., Jensen, R V., Hsiao, L.-L., Gullans, S. R.,    Blumenstock, I. E., Ramaswami, S., Richards, W. G., Sugarbaker, D.    J., and Bueno, R. Translation of microarray data into clinically    relevant cancer diagnostic tests using gene expression ratios in    lung cancer and mesothelioma, Cancer Res. 62: TBD (Septeber 1    issue), 2002.-   8. Jemal, A., Thomas, A., Murray, T., and Thun, M. Cancer    statistics, 2002, CA Cancer. J. Clin. 52: 23-47, 2002.-   9. Etzioni, R., Penson, D. F., Legler, J. M., Tommaso, D., Boer, R.,    Gann, P. H., and Feuer, E. J. Overdiagnosis due to prostate-specific    antigen screening: Lessons from U.S. prostate cancer incidence    trends, 3. Natl. Cancer Inst. 94: 981-990, 2002.-   10. Djavan, B., Zlotta, A., Kratzik, C., Remzi, M., Seitz, C.,    Schulman, C. C., and Marberger, M. PSA, PSA density, PSA density of    transition zone, free/total PSA ratio, and PSA velocity for early    detection of prostate cancer in men with serum PSA 2.5 to 4.0 ng/mL,    Urology. 54: 517-522, 2001.-   11. Pannek, J. and Partin, A. W. The role of PAS and percent free    PSA for staging and prognosis prediction in clinically localized    prostate cancer, Semin. Urol. Oncol. 16: 100-105, 1998.-   12. Dhanasekaran, S. M., Barrette, T. R., Ghosh, D., Shah, R.,    Varambally, S., Kurachi, K., Pienta, K. J., Rubin, M. A., and    Chinnaiyan, A. M. Delineation of prognostic biomarkers in prostate    cancer, Nature. 412: 822-826, 2001.-   13. Welsh, J. B., Sapinoso, L. M., Su, A. I., Kem, S. G.,    Wang-Rodriguez, J., Moskaluk, C. A., Frierson, H. F., and    Hampton, G. M. Analysis of gene expression identifies candidate    markers and pharmacological targets in prostate cancer, Cancer Res.    61: 5974-5978, 2001.-   14. Venables, W. N. and Riley, B. D. Modern Applied Statistics with    S-Plus. New York: Springer, 1997.

Example 5 Prediction of Outcomes of Lung Adenocarcinoma Using ExpressionProfiling Data

This example describes the use of published data relating geneexpression profiles and outcome in lung adenocarcinoma. A set of generatios was generated by analyzing the data from Beer et al. (Nature Med.8: 816-824, 2002), who used smaller chips (6800 genes), as a trainingset.

The training set ratios were tested using the published data set derivedfrom expression profiling experiments using 12,000 genes (Bhattacharjeeet al., Proc. Natl. Acad. Sci. USA. 98: 13790-13795, 2001). The objectwas to predict good outcome versus recurrence in stage I lung cancerafter surgery. As shown below, the ratios derived from the training setdata (Beer et al.) could differentiate significantly between good andpoor outcomes in the test set data (Bhattacharjee et al.).

The analysis the other direction (using Bhattacharjee et al. expressiondata as the training set and Beer et al. expression data as the testset) did not work because the best genes in the analysis of theBhattacharjee et al. expression data were not present in the genes (6800gene chips) analyzed by Beer et al.

Gene selection criteria: Genes having a >2-fold higher expression ingood or poor outcome samples, and the lowest (best) p values, wereselected.

Training Set (Beer et al. data); good outcome (n=21) means alive at 5years; poor outcome (n=11) means disease recurrence within 4 years.

TABLE 13 Genes overexpressed in tumors of different outcomeOverexpressed Locus Link Gene # in . . . Symbol Accession # Description1 Good APOE M12529 apolipoprotein E 4 Good LPIN2 D87436 lipin 2 5 PoorSLC2A1 K03195 solute carrier family 2 (facilitated glucose transporter),member 1 6 Poor S100P AA131149 S100 calcium-binding protein P 7 PoorMST1R X70040 macrophage stimulating 1 receptor (c-met-related tyrosinekinase)

Gene ratios were calculated as follows:

-   -   genes overexpressed in good outcome    -   genes overexpressed in poor outcome

The application of the ratios is shown in Table 14.

TABLE 14 Training set gene ratios for predicting outcome error gooderror poor total error % correct 1/5 1 11 12 63 1/6 4 2 6 81 1/7 1 10 1166 1/8 1 6 7 78 2/5 4 4 8 75 2/6 14 1 15 53 2/7 6 5 11 66 2/8 5 4 9 723/5 4 3 7 78 3/6 12 0 12 63 3/7 4 3 7 78 3/8 6 4 10 69 4/5 3 1 4 88 4/614 0 14 56 4/7 5 2 7 78 4/8 4 3 7 78 1/6, 4/5, 4/7 1 2 3 91 Error good =number of errors in predicting good outcome in training set Error poor =number of errors in predicting poor outcome in training set Error total= number of total errors in predicting outcome in training set

The top 3 ratios from the training set (1/6, 4/5, 4/7) were chosenaccording to following criteria: poor >80% correctly identified andoverall >75% correctly identified. The combination of these three ratiosresulting in the prediction of 20/21 good outcome tumors and 9/11 pooroutcome tumors (29/32=91%).

The three ratio combination was applied to the test set data ofBhattacharjee et al. The results are shown in Table 15:

TABLE 15 Application of gene expression ratios to test set Test Set(Bhattacharjee et al. data); good outcome (n = 28) means alive at 5years; poor outcome (n = 19) means disease recurrence within 4 years.error good error poor total % correct % correct poor ⅙, ⅘, 4/7 7 8 15 6858 (32/47) (11/19) Error good = number of errors in predicting goodoutcome in test set Error poor = number of errors in predicting pooroutcome in test set Error total = number of total errors in predictingoutcome in test set

TABLE 16 Predictions and status for individual tumor samples Stage 1adenocarcinoma only; excluded patient tissue samples of <40% tumor celland/or mixed histology. survival status Censor % tumor GroupAdeno31618Good 60.5 1 10 50 Good Adeno31633Good 83 1 10 ? GoodAdeno32004Good 85.9 1 0 30 Poor Adeno32109Good 99.1 1 0 80 GoodAdeno32137Good 98.9 1 0 60 Poor Adeno32019Good 66.2 1 0  5 PoorAdeno32314Good 62.6 1 0 40 Good Adeno32027Good 71.1 1 0 70 GoodAdeno32233Good 72.4 1 0 80 Poor Adeno32244Good 75.4 1 0 100 GoodAdeno32618Good 78.4 1 0 35 Good Adeno32605Good 106 1 0 40 GoodAdeno32713Good 81.9 1 0 ? Good Adeno32845Good 76.6 1 0 25 PoorAdeno32744Good 93.7 1 0 10 Good Adeno32731Good 76 1 0 80 GoodAdeno32708Good 103 1 0 28 Good Adeno32140Good 49 1 0 30 GoodAdeno32034Good 49.2 1 0 60 Good Adeno32633Good 50.1 1 0 100 GoodAdeno32318Good 50.1 1 0 70 Good Adeno32103Good 50.5 1 0 80 GoodAdeno32846Good 52.9 1 0 65 Good Adeno32212Good 54.5 1 0 80 PoorAdeno32012Good 56 1 0 95 Good Adeno32032Good 56.3 A 0 90 GoodAdeno32142Good 56.7 1 0 80 Good Adeno31613Good 57.6 1 0 80 GoodAdeno32132Good 58.5 1 0 90 Poor Adeno32138Good 59 1 0 80 GoodAdeno32628Good 59.3 1 0 30 Good Adeno32026Good 66.8 4 0 90 GoodAdeno32614Good 76.1 2 0 80 Poor Adeno32706Good 79 2 0 50 GoodAdeno32031Good 91 2 0 80 Poor Adeno32602Good 71.5 D 1 100 PoorAdeno32013Poor 41.9 3 1 60 Good Adeno32840Poor 42.2 2 0 60 PoorAdeno32252Poor 45.5 2 0 30 Poor Adeno32020Poor 49.6 3 1 60 GoodAdeno32254Poor 47.2 3 1 60 Poor Adeno31635Poor 48.3 D 1 30 GoodAdeno32309Poor 48.8 D 1 90 Good Adeno32634Poor 40.5 2 0 90 GoodAdeno31628Poor 25.3 3 1 40 Poor Adeno31630Poor 40.7 3 1 33 PoorAdeno32005Poor 7.8 3 1 40 Poor Adeno32029Poor 21.8 3 1 70 GoodAdeno32030Poor 16.5 3 1 60 Poor Adeno32211Poor 14.2 3 1 70 PoorAdeno32248Poor 8.8 3 1 80 Poor Adeno32312Poor 41.2 3 1 80 PoorAdeno32313Poor 23.4 3 1 70 Good Adeno32322Poor 23.4 3 1 70 GoodAdeno32613Poor 7.3 3 1 80 Poor Adeno32735Poor 20 3 1 70 PoorAdeno32739Poor 38.9 3 1 30 Poor Adeno32746Poor 8.2 3 1 40 PoorAdeno32748Poor 37.6 3 1 95 Good Adeno32837Poor 37.9 3 1 60 GoodAdeno32848Poor 10.5 3 1 30 Good Survival: in months Status: 1 = alivewithout disease; 2 = alive with disease; 3 = dead from disease; 4 = deadfrom other causes; A = alive, disease status unknown; D = dead, reasonunknown. Censor: for Kaplan-Meier analysis (see FIG. 9); 0 = nocensoring event; 1 = presence of censoring event % tumor: in sample onslide Group: good or poor outcome as predicted by gene ratios

Example 6 Analysis of Gene Expression Data in Various Cancers forDiagnosis and Prognosis

This example represents analyses of gene expression profiling datapresented in the literature for several different types of cancer. Eachchart has several lists of genes that are increased in expression ordecreased in expression in a given diagnosis or prognosis.

The method applied to the analyses of the data uses a combination ofratios of genes from one set always in the numerator and a second setalways in the denominator to determine diagnosis or prognosis. The genesused in the ratios for determination of diagnosis or prognosis arenumbered.

A. Rosenwald et al. (N Engl J Med 346(25):1937-1947, 2002), Diagnosis ofsubtype germinal-center B-cell-like (GCB) vs type III in diffuse largeB-cell lymphoma (DLBCL). Genes having a >2-fold higher expression indifferent diagnosis samples, and the lowest (best) p values, are shownin Table 17.

Training set=109 samples, Test set=58 samples

TABLE 17 Genes overexpressed in germinal-center B-cell-like (GCB) ortype III in diffuse large B-cell lymphoma (DLBCL) Uniquid GenesAccession # overexpressed in GCB 1 24991 ~AA825906 2 24376 ESTs, Weaklysimilar to A47224 Hs.317970 thyroxine-binding globulin precursor 3 19384MAPK10: mitogen-activated protein Hs.151051 kinase 10 4 15914 MAPK10:mitogen-activated protein Hs.151051 kinase 10 29912 ESTs, Weakly similarto neuronal Hs.104425 thread protein [Homo sapiens] [H. sapiens] 20198FEM1B: fem-1 homolog b AF178632 (C. elegans) 29967 Homo sapiens cDNAFLJ11170 fis, AK002032 clone PLACE1007301 24971 KIAA0807 proteinAB018350 28472 MAPK10: mitogen-activated protein U07620 kinase 10 25069OSBPL3: oxysterol binding protein- AB014604 like 3 overexpressed in TypeIII 5 30880 EST Hs.275766 6 27783 LY6E: lymphocyte antigen 6 Hs.77667complex, locus E 7 16430 PML: promyelocytic leukemia Hs.89633 8 25001FLT3LG: fms-related tyrosine Hs.428 kinase 3 ligand 34166 CLCN7:chloride channel 7 Hs.80768 32671 ZAP70: zeta-chain (TCR) associatedL05148 protein kinase (70 kD) 25196 TRG@: T cell receptor gamma locusM30894 27460 LC_27460 32673 CD6: CD6 antigen X60992 27147 PTPN13:protein tyrosine D21210 phosphatase, non-receptor type 13 (APO-1/CD95(Fas)-associated phosphatase)

B. Welsh et al. (Proc Natl Acad Sci USA 98(3):1176-1181, 2001), Analysisof gene expression profiles in normal and neoplastic ovarian tissuesamples identifies candidate molecular markers of epithelial ovariancancer. Genes having a >2-fold higher expression in ovarian tumor ornormal samples, and the lowest (best)_(p) values, are shown in Table 18.

Training set=22 samples, Test set=12 samples

TABLE 18 Genes overexpressed in tumor or normal ovarian epithelium ProbeID Gene Overexpressed in Tumor 1 X12876_s_at KRT18: keratin 18 2HG110-HT110_s_at HNRPAB: heterogeneous nuclear ribonucleoprotein A/B 3Y00503_at KRT19: keratin 19 4 M93036_at TACSTD1: tumor-associatedcalcium signal transducer 1 X74929_s_at KRT8: keratin 8 HG2815-HT2931_atMYL6: myosin, light polypeptide 6, alkali, smooth muscle and non-muscleJ02783_at P4HB: procollagen-proline, 2-oxoglutarate 4- dioxygenase(proline 4-hydroxylase), beta polypeptide (protein disulfide isomerase;thyroid hormone binding protein p55) X17567_s_at SNRPB: small nuclearribonucleoprotein polypeptides B and B1 L19686_rnal_at MIF: macrophagemigration inhibitory factor (glycosylation-inhibiting factor) X69699_atPAX8: paired box gene 8 Overexpressed in Normal 5 U24488_s_at TNXB:tenascin XB 6 D26155_s_at SMARCA2: SWI/SNF related, matrix associated,actin dependent regulator of chromatin, subfamily a, member 2 7U17280_at STAR: steroidogenic acute regulatory protein 8 X86401_s_atGATM: glycine amidinotransferase (L-arginine:glycine amidinotransferase)X63741_s_at EGR3: early growth response 3 U90336_at PEG3: paternallyexpressed 3 M21574_at PDGFRA: platelet-derived growth factor receptor,alpha polypeptide Z26653_at LAMA2: laminin, alpha 2 (merosin, congenitalmuscular dystrophy) U36922_at FOXO1A: forkhead box O1A(rhabdomyosarcoma) M97796_s_at ID2: inhibitor of DNA binding 2, dominantnegative helix-loop-helix protein

C. Rosenwald et al. (N Engl J Med 346(25):1937-1947, 2002), Diagnosis ofsubtypes germinal-center B-cell-like (GBC) vs activated B-cell-like(ABC) in diffuse large-B-cell lymphoma (DLBCL). Genes having a >2-foldhigher expression in different diagnosis samples, and the lowest (best)p values, are shown in Table 19.

Training set=129 samples, test set=59 samples

TABLE 19 Genes overexpressed in germinal-center B-cell-like (GBC) oractivated B-cell-like (ABC) in diffuse large-B-cell lymphoma (DLBCL)Probe ID Gene Accession # Overexpressed in GBC 1 24376 ESTs, Weaklysimilar to A47224 Hs.317970 thyroxine-binding globulin precursor 2 24480LC_24480 3 19384 MAPK10: mitogen-activated protein Hs.151051 kinase 10 415914 MAPK10: mitogen-activated protein Hs.151051 kinase 10 24991AA825906 25126 LC_25126 34694 DKFZP434M098 protein AL117587 19202 STAG3:stromal antigen 3 AJ007798 24825 HDAC1: histone deacetylase 1 D5040526725 EBF: early B-cell factor AF208502 Overexpressed in ABC 5 19375DDB1: damage-specific DNA Hs.108327 binding protein 1 (127 kD) 6 19346SH3BP5: SH3-domain binding Hs.109150 protein 5 (BTK-associated) 7 22118LC_22118 8 27565 ENTPD1: ectonucleoside Hs.205353 triphosphatediphosphohydrolase 1 33991 FOXP1: forkhead box P1 AF146696 26454 SH3BP5:SH3-domain binding AB005047 protein 5 (BTK-associated) 16614 IRF4:interferon regulatory factor 4 U52682 28536 ENTPD1: ectonucleosideS73813 triphosphate diphosphohydrolase 1 31104 LC_31104 33109 BLNK:B-cell linker AF068180 31801 BMF: Bcl-2 modifying factor NM_033503

D. Shipp et al (Nat Med 8(1):68-74, 2002), Diagnosis of diffuse largeB-cell lymphoma (DLBCL) vs. follicular lymphoma (FL). Genes havinga >2-fold higher expression in different diagnosis samples, and thelowest (best) p values, are shown in Table 20.

Training set 39 samples, Test set 38 samples

TABLE 20 Genes overexpressed in diffuse large B-cell lymphoma (DLBCL) orfollicular lymphoma (FL) Accession # Gene Overexpressed in DLBCL 1D43950_at CCT5: chaperonin containing TCP1, subunit 5 (epsilon) 2U28386_at KPNA2: karyopherin alpha 2 (RAG cohort 1, importin alpha 1) 3U63743_at KNSL6: kinesin-like 6 (mitotic centromere-associated kinesin)4 X65867_at ADSL: adenylosuccinate lyase M22960_at PPGB: protectiveprotein for beta-galactosidase (galactosialidosis) J02783_at P4HB:procollagen-proline, 2-oxoglutarate 4- dioxygenase (proline4-hydroxylase), beta polypeptide (protein disulfide isomerase; thyroidhormone binding protein p55) X62078_at GM2A: GM2 ganglioside activatorprotein M34079_at PSMC3: proteasome (prosome, macropain) 26S subunit,ATPase, 3 J02645_at EIF2S1: eukaryotic translation initiation factor 2,subunit 1 (alpha, 35 kD) U23143_at SHMT2: serinehydroxymethyltransferase 2 (mitochondrial) Overexpressed in FL 5AB002409_at SCYA21: small inducible cytokine subfamily A (Cys- Cys),member 21 6 D87119_at gene with protein product, function unknown 7Z11793_at SEPP1: selenoprotein P, plasma, 1 8 HG3928-HT4198_at SFTPA2:surfactant, pulmonary-associated protein A2 X91911_s_at RTVP1: gliomapathogenesis-related protein D50683_at TGFBR2: transforming growthfactor, beta receptor II (70-80 kD) K02777_s_at TRA@: T cell receptoralpha locus M18255_cds2_s_at Homo sapiens cDNA FLJ32993 fis, cloneTHYMU1000103, weakly similar to PROTEIN KINASE C, BETA-I TYPE (EC2.7.1.—) M12963_s_at ADH1A: alcohol dehydrogenase 1A (class I), alphapolypeptide HG2239-HT2324_r_at KCNC3: potassium voltage-gated channel,Shaw- related subfamily, member 3 D45370_at APM2: adipose specific 2

E. Rosenwald et al. (N Engl J Med 346(25):1937-1947, 2002), Diagnosis ofsubtype activated B-cell-like (ABC) vs type III diffuse large-B-celllymphoma. Genes having a >2-fold higher expression in differentdiagnosis samples, and the lowest (best)_(p) values, are shown in Table21.

Training set=82 samples, test set=43 samples

TABLE 21 Genes overexpressed in subtype activated B-cell-like (ABC) ortype III diffuse large-B-cell lymphoma Probe ID Gene Accession #Overexpressed in ABC 1 22122 IRF4: interferon regulatory factor 4Hs.82132 2 33991 FOXP1: forkhead box P1 Hs.274344 3 24899 PIM1: pim-1oncogene Hs.81170 4 24416 PIM1: pim-1 oncogene Hs.81170 24729 IRF4:interferon regulatory factor 4 U52682 16614 IRF4: interferon regulatoryfactor 4 U52682 24701 ZNFN1A1: zinc finger protein, subfamily 1A, 1U40462 (Ikaros) 26516 ESTs, Weakly similar to HERV-E envelope Hs.370685glycoprotein [H. sapiens] 19348 ESTs, Weakly similar to HERV-E envelopeHs.370685 glycoprotein [H. sapiens] 19375 DDB1: damage-specific DNAbinding protein 1 U32986 (127 kD) overexpressed in Type III 5 28060EPHB6: EphB6 Hs.3796 6 17533 EPHB6: EphB6 Hs.3796 7 29998 CCND1: cyclinD1 (PRAD1: parathyroid Hs.82932 adenomatosis 1) 8 27147 PTPN13: proteintyrosine phosphatase, non-receptor Hs.211595 type 13 (APO-1/CD95(Fas)-associated phosphatase) 27974 IGF1: insulin-like growth factor 1(somatomedin C) X57025 27857 CST3: cystatin C (amyloid angiopathy andcerebral X05607 hemorrhage) 29780 LC_29780 28766 PTPRM: protein tyrosinephosphatase, receptor type, M X58288 15930 CHL1: cell adhesion moleculewith homology to AF002246 L1CAM (close homolog of L1) 27460 LC_2746034166 CLCN7: chloride channel 7 Hs.80768

F. Shipp et al (Nat Med. 8(1):68-74, 2002), Prognosis of diffuse largeB-cell lymphoma (DLBCL). Good prognosis was defined by Shipp as nodisease recurrence; bad prognosis was defined by Shipp as recurrence ofdisease. Genes having a >2-fold higher expression in good or poorprognosis samples, and the lowest (best) p values, are shown in Table22.

Training set, n=29; Test set, n=29

TABLE 22 Genes overexpressed in diffuse large B-cell lymphoma (DLBCL) ofgood and poor outcome Accession # Gene Overexpressed in Good 1 L05512_atHTN1: histatin 1 2 U73328_at DLX4: distal-less homeobox 4 3 Y13247_atPPP1R10: protein phosphatase 1, regulatory subunit 10 4 M29277_at MCAM:melanoma cell adhesion molecule Overexpressed in Poor 5 D86969_atKIAA0215 gene product 6 L20971_at PDE4B: phosphodiesterase 4B,cAMP-specific (phosphodiesterase E4 dunce homolog, Drosophila) 7M18255_cds2_s_at Homo sapiens cDNA FLJ32993 fis, clone THYMU1000103,weakly similar to PROTEIN KINASE C, BETA-I TYPE (EC 2.7.1.—) 8HG4322-HT4592_at TUBB: tubulin, beta polypeptide

Example 7 Prognosis of Lung Adenocarcinoma

Data from Bhattacharjee et al. for Stage 1 lung cancer was used as inExample 5, except that: 1) only samples with >50% tumor were used, and2) a 5 year survival cutoff was used instead of 4 year survival. Thusthe criteria for prognosis were: good=alive, survival >60 mos,poor=dead, survival <60 mos. This reduced the sample numbers to: n=12for good prognosis, n=17 poor prognosis. Genes having a >2-fold higherexpression in good or poor prognosis samples, and the lowest (best) pvalues, are shown in Table 23.

TABLE 23 Genes overexpressed in good or poor outcome Overexpressed LocusLink Accession in . . . Symbol # Description Good FOLR1 U78793 folatereceptor 1 (adult) Good DUSP6 AB013382 dual specificity phosphatase 6Good SEPP1 Z11793 selenoprotein P, plasma, 1 Good LTF U95626lactotransferrin Good KIAA0758 AB018301 KIAA0758 protein Poor MMP9J05070 matrix metalloproteinase 9 (gelatinase B, 92 kD gelatinase, 92 kDtype IV collagenase) Poor IGFBP3 M35878 insulin-like growth factorbinding protein 3 Poor FN1 None Fibronectin, Alt. Splice 1 (TIGR311_s_at seq.) Poor UBCH10 U73379 ubiquitin-conjugating enzyme E2C PoorUBD AL031983 diubiquitin

The present invention is not limited in scope by the examples provided,since the examples are intended as illustrations of various aspects ofthe invention and other functionally equivalent embodiments are withinthe scope of the invention. Various modifications of the invention inaddition to those shown are described herein will become apparent tothose skilled in the art for the foregoing description and fall withinthe scope of the appended claims. The advantages and objects of theinvention are not necessarily encompassed by each embodiment of theinvention.

All references, patents, and patent publications that are recited inthis application are incorporated in their entirety herein by reference.

1. A method for diagnosing the presence of cancer cells or non-cancercells in a tissue sample, comprising providing a set of two or moregenes, wherein the set comprises at least one upregulated gene that isexpressed in greater amounts in the cancer cells than in correspondingnon-cancer cells and at least one downregulated gene that is expressedin lesser amounts in cancer cells than in corresponding non-cancercells, determining the expression levels of the set of two or moregenes, calculating at least one ratio of the expression level of the atleast one upregulated gene to the expression level of the at least onedownregulated gene, wherein the at least one ratio is indicative of thepresence of cancer cells or non-cancer cells in the tissue sample. 2.The method of claim 1, wherein there is at least a 2-fold difference inmean expression levels between the at least one upregulated gene and theat least one downregulated gene.
 3. The method of claim 1, wherein twoor more expression ratios are calculated.
 4. The method of claim 3,further comprising combining the two or more expression ratios.
 5. Themethod of claim 4, wherein the step of combining the two or moreexpression ratios comprises calculating the geometric mean of the two ormore expression ratios.
 6. The method of claim 1, wherein the ratio iscalculated by division of the expression level of one upregulated geneby the expression level of one downregulated gene.
 7. The method ofclaim 1, wherein the ratio is calculated by division of the expressionlevels of two or more upregulated genes by the expression level of onedownregulated gene.
 8. The method of claim 1, wherein the ratio iscalculated by division of the expression level of one upregulated geneby the expression levels of two or more downregulated genes.
 9. Themethod of claim 1, wherein the ratio is calculated by division of theexpression levels of two or more upregulated genes by the expressionlevels of two or more downregulated genes.
 10. The method of claim 1,further comprising transforming the expression level data for theupregulated and/or downregulated genes prior to calculating the ratio.11. The method of claim 1, wherein the expression levels are determinedby a method selected from the group consisting of nucleic acidhybridization and nucleic acid amplification.
 12. The method of claim11, wherein the nucleic acid hybridization is performed using asolid-phase nucleic acid molecule array.
 13. The method of claim 11,wherein the nucleic acid amplification method is real-time PCR.
 14. Themethod of claim 1, wherein the expression levels are determined by animmunological method.
 15. The method of claim 14, wherein theimmunological method is performed using a solid-phase antibody array.16. The method of claim 14, wherein the immunological method is an ELISAor ELISPOT assay.
 17. The method of claim 1, wherein the cancer isselected from the group consisting of malignant pleural mesothelioma,lung adenocarcinoma, squamous carcinoma, medulloblastoma, prostatecancer, breast cancer, diffuse large B-cell lymphoma, follicularlymphoma and ovarian cancer.
 18. The method of claim 1, wherein the atleast one ratio is indicative of the presence of cancer cells in thetissue sample.
 19. The method of claim 1, wherein the at least one ratiois indicative of the presence of non-cancer cells in the tissue sample.